How does Pytorch Dataloader handle variable size data?How do you split a list into evenly sized chunks?How do you change the size of figures drawn with matplotlib?How do I pass a variable by reference?How to check file size in python?How does Python's super() work with multiple inheritance?How to access environment variable values?How to read a text file into a string variable and strip newlines?How do I write JSON data to a file?How does the @property decorator work?PyTorch DataLoader

What does 사자 in this picture means?

Stereotypical names

Would it be legal for a US State to ban exports of a natural resource?

How do I rename a LINUX host without needing to reboot for the rename to take effect?

Giant Toughroad SLR 2 for 200 miles in two days, will it make it?

Why are on-board computers allowed to change controls without notifying the pilots?

How can I raise concerns with a new DM about XP splitting?

What is the oldest known work of fiction?

Should a half Jewish man be discouraged from marrying a Jewess?

How will losing mobility of one hand affect my career as a programmer?

Why are all the doors on Ferenginar (the Ferengi home world) far shorter than the average Ferengi?

Indicating multiple different modes of speech (fantasy language or telepathy)

The One-Electron Universe postulate is true - what simple change can I make to change the whole universe?

Why does this part of the Space Shuttle launch pad seem to be floating in air?

Is it okay / does it make sense for another player to join a running game of Munchkin?

What to do when my ideas aren't chosen, when I strongly disagree with the chosen solution?

Is there a good way to store credentials outside of a password manager?

Can the harmonic series explain the origin of the major scale?

What if somebody invests in my application?

How to color a zone in Tikz

Have I saved too much for retirement so far?

What is the term when two people sing in harmony, but they aren't singing the same notes?

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

Greatest common substring

How does Pytorch Dataloader handle variable size data?

How do you split a list into evenly sized chunks?How do you change the size of figures drawn with matplotlib?How do I pass a variable by reference?How to check file size in python?How does Python's super() work with multiple inheritance?How to access environment variable values?How to read a text file into a string variable and strip newlines?How do I write JSON data to a file?How does the @property decorator work?PyTorch DataLoader

I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.

0 24104 27359 6684
0 24104 27359
1 16742 31529 31485
1 16742 31529
2 6579 19316 13091 7181 6579 19316 13091
2 6579 19316 13091 7181 6579 19316
2 6579 19316 13091 7181 6579 19316 13091 6579
2 6579 19316 13091 7181 6579
4 19577 21608
4 19577 21608
4 19577 21608 18373
5 3541 9529
5 3541 9529
6 6832 19218 14144
6 6832 19218
7 9751 23424 25067 12606 26245 23083 12606

I define a custom dataset to handle my click log data.

import torch.utils.data as data
class ClickLogDataset(data.Dataset):
 def __init__(self, data_path):
 self.data_path = data_path
 self.uids = []
 self.streams = []

 with open(self.data_path, 'r') as fdata:
 for row in fdata:
 row = row.strip('n').split('t')
 self.uids.append(int(row[0]))
 self.streams.append(list(map(int, row[1:])))

 def __len__(self):
 return len(self.uids)

 def __getitem__(self, idx):
 uid, stream = self.uids[idx], self.streams[idx]
 return uid, stream

Then I use a DataLoader to retrieve mini batches from the data for training.

from torch.utils.data.dataloader import DataLoader
clicklog_dataset = ClickLogDataset(data_path)
clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

for uid_batch, stream_batch in stream_data_loader:
 print(uid_batch)
 print(stream_batch)

The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?

#stream_batch
[tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
 19577, 3541, 3541, 6832, 6832, 9751])]

asked Mar 7 at 10:08

Trung Le

284

add a comment |

I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.

0 24104 27359 6684
0 24104 27359
1 16742 31529 31485
1 16742 31529
2 6579 19316 13091 7181 6579 19316 13091
2 6579 19316 13091 7181 6579 19316
2 6579 19316 13091 7181 6579 19316 13091 6579
2 6579 19316 13091 7181 6579
4 19577 21608
4 19577 21608
4 19577 21608 18373
5 3541 9529
5 3541 9529
6 6832 19218 14144
6 6832 19218
7 9751 23424 25067 12606 26245 23083 12606

I define a custom dataset to handle my click log data.

import torch.utils.data as data
class ClickLogDataset(data.Dataset):
 def __init__(self, data_path):
 self.data_path = data_path
 self.uids = []
 self.streams = []

 with open(self.data_path, 'r') as fdata:
 for row in fdata:
 row = row.strip('n').split('t')
 self.uids.append(int(row[0]))
 self.streams.append(list(map(int, row[1:])))

 def __len__(self):
 return len(self.uids)

 def __getitem__(self, idx):
 uid, stream = self.uids[idx], self.streams[idx]
 return uid, stream

Then I use a DataLoader to retrieve mini batches from the data for training.

from torch.utils.data.dataloader import DataLoader
clicklog_dataset = ClickLogDataset(data_path)
clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

for uid_batch, stream_batch in stream_data_loader:
 print(uid_batch)
 print(stream_batch)

#stream_batch
[tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
 19577, 3541, 3541, 6832, 6832, 9751])]

asked Mar 7 at 10:08

Trung Le

284

add a comment |

I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.

0 24104 27359 6684
0 24104 27359
1 16742 31529 31485
1 16742 31529
2 6579 19316 13091 7181 6579 19316 13091
2 6579 19316 13091 7181 6579 19316
2 6579 19316 13091 7181 6579 19316 13091 6579
2 6579 19316 13091 7181 6579
4 19577 21608
4 19577 21608
4 19577 21608 18373
5 3541 9529
5 3541 9529
6 6832 19218 14144
6 6832 19218
7 9751 23424 25067 12606 26245 23083 12606

I define a custom dataset to handle my click log data.

import torch.utils.data as data
class ClickLogDataset(data.Dataset):
 def __init__(self, data_path):
 self.data_path = data_path
 self.uids = []
 self.streams = []

 with open(self.data_path, 'r') as fdata:
 for row in fdata:
 row = row.strip('n').split('t')
 self.uids.append(int(row[0]))
 self.streams.append(list(map(int, row[1:])))

 def __len__(self):
 return len(self.uids)

 def __getitem__(self, idx):
 uid, stream = self.uids[idx], self.streams[idx]
 return uid, stream

Then I use a DataLoader to retrieve mini batches from the data for training.

from torch.utils.data.dataloader import DataLoader
clicklog_dataset = ClickLogDataset(data_path)
clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

for uid_batch, stream_batch in stream_data_loader:
 print(uid_batch)
 print(stream_batch)

#stream_batch
[tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
 19577, 3541, 3541, 6832, 6832, 9751])]

asked Mar 7 at 10:08

Trung Le

284

I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.

0 24104 27359 6684
0 24104 27359
1 16742 31529 31485
1 16742 31529
2 6579 19316 13091 7181 6579 19316 13091
2 6579 19316 13091 7181 6579 19316
2 6579 19316 13091 7181 6579 19316 13091 6579
2 6579 19316 13091 7181 6579
4 19577 21608
4 19577 21608
4 19577 21608 18373
5 3541 9529
5 3541 9529
6 6832 19218 14144
6 6832 19218
7 9751 23424 25067 12606 26245 23083 12606

I define a custom dataset to handle my click log data.

import torch.utils.data as data
class ClickLogDataset(data.Dataset):
 def __init__(self, data_path):
 self.data_path = data_path
 self.uids = []
 self.streams = []

 with open(self.data_path, 'r') as fdata:
 for row in fdata:
 row = row.strip('n').split('t')
 self.uids.append(int(row[0]))
 self.streams.append(list(map(int, row[1:])))

 def __len__(self):
 return len(self.uids)

 def __getitem__(self, idx):
 uid, stream = self.uids[idx], self.streams[idx]
 return uid, stream

Then I use a DataLoader to retrieve mini batches from the data for training.

from torch.utils.data.dataloader import DataLoader
clicklog_dataset = ClickLogDataset(data_path)
clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

for uid_batch, stream_batch in stream_data_loader:
 print(uid_batch)
 print(stream_batch)

#stream_batch
[tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
 19577, 3541, 3541, 6832, 6832, 9751])]

python pytorch tensor variable-length

asked Mar 7 at 10:08

Trung Le

284

asked Mar 7 at 10:08

Trung Le

284

asked Mar 7 at 10:08

Trung Le

284

asked Mar 7 at 10:08

Trung Le

284

asked Mar 7 at 10:08

Trung Le

284

add a comment |

2 Answers
2

active

oldest

votes

So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.

answered Mar 7 at 10:23

Jatentaki

2,095916

add a comment |

As @Jatentaki suggested, I wrote my custom collate function and it worked fine.

def get_max_length(x):
 return len(max(x, key=len))

def pad_sequence(seq):
 def _pad(_it, _max_len):
 return [0] * (_max_len - len(_it)) + _it
 return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
 transposed = zip(*batch)
 lst = []
 for samples in transposed:
 if isinstance(samples[0], int):
 lst.append(torch.LongTensor(samples))
 elif isinstance(samples[0], float):
 lst.append(torch.DoubleTensor(samples))
 elif isinstance(samples[0], collections.Sequence):
 lst.append(torch.LongTensor(pad_sequence(samples)))
 return lst

stream_dataset = StreamDataset(data_path)
stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset, 
 batch_size=batch_size, 
 collate_fn=custom_collate,
 shuffle=False)

answered Mar 8 at 8:56

Trung Le

284

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55041080%2fhow-does-pytorch-dataloader-handle-variable-size-data%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Mar 7 at 10:23

Jatentaki

2,095916

add a comment |

answered Mar 7 at 10:23

Jatentaki

2,095916

add a comment |

answered Mar 7 at 10:23

Jatentaki

2,095916

answered Mar 7 at 10:23

Jatentaki

2,095916

answered Mar 7 at 10:23

Jatentaki

2,095916

answered Mar 7 at 10:23

Jatentaki

2,095916

answered Mar 7 at 10:23

Jatentaki

2,095916

add a comment |

As @Jatentaki suggested, I wrote my custom collate function and it worked fine.

def get_max_length(x):
 return len(max(x, key=len))

def pad_sequence(seq):
 def _pad(_it, _max_len):
 return [0] * (_max_len - len(_it)) + _it
 return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
 transposed = zip(*batch)
 lst = []
 for samples in transposed:
 if isinstance(samples[0], int):
 lst.append(torch.LongTensor(samples))
 elif isinstance(samples[0], float):
 lst.append(torch.DoubleTensor(samples))
 elif isinstance(samples[0], collections.Sequence):
 lst.append(torch.LongTensor(pad_sequence(samples)))
 return lst

stream_dataset = StreamDataset(data_path)
stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset, 
 batch_size=batch_size, 
 collate_fn=custom_collate,
 shuffle=False)

answered Mar 8 at 8:56

Trung Le

284

add a comment |

As @Jatentaki suggested, I wrote my custom collate function and it worked fine.

def get_max_length(x):
 return len(max(x, key=len))

def pad_sequence(seq):
 def _pad(_it, _max_len):
 return [0] * (_max_len - len(_it)) + _it
 return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
 transposed = zip(*batch)
 lst = []
 for samples in transposed:
 if isinstance(samples[0], int):
 lst.append(torch.LongTensor(samples))
 elif isinstance(samples[0], float):
 lst.append(torch.DoubleTensor(samples))
 elif isinstance(samples[0], collections.Sequence):
 lst.append(torch.LongTensor(pad_sequence(samples)))
 return lst

stream_dataset = StreamDataset(data_path)
stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset, 
 batch_size=batch_size, 
 collate_fn=custom_collate,
 shuffle=False)

answered Mar 8 at 8:56

Trung Le

284

add a comment |

As @Jatentaki suggested, I wrote my custom collate function and it worked fine.

def get_max_length(x):
 return len(max(x, key=len))

def pad_sequence(seq):
 def _pad(_it, _max_len):
 return [0] * (_max_len - len(_it)) + _it
 return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
 transposed = zip(*batch)
 lst = []
 for samples in transposed:
 if isinstance(samples[0], int):
 lst.append(torch.LongTensor(samples))
 elif isinstance(samples[0], float):
 lst.append(torch.DoubleTensor(samples))
 elif isinstance(samples[0], collections.Sequence):
 lst.append(torch.LongTensor(pad_sequence(samples)))
 return lst

stream_dataset = StreamDataset(data_path)
stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset, 
 batch_size=batch_size, 
 collate_fn=custom_collate,
 shuffle=False)

answered Mar 8 at 8:56

Trung Le

284

As @Jatentaki suggested, I wrote my custom collate function and it worked fine.

def get_max_length(x):
 return len(max(x, key=len))

def pad_sequence(seq):
 def _pad(_it, _max_len):
 return [0] * (_max_len - len(_it)) + _it
 return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
 transposed = zip(*batch)
 lst = []
 for samples in transposed:
 if isinstance(samples[0], int):
 lst.append(torch.LongTensor(samples))
 elif isinstance(samples[0], float):
 lst.append(torch.DoubleTensor(samples))
 elif isinstance(samples[0], collections.Sequence):
 lst.append(torch.LongTensor(pad_sequence(samples)))
 return lst

stream_dataset = StreamDataset(data_path)
stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset, 
 batch_size=batch_size, 
 collate_fn=custom_collate,
 shuffle=False)

answered Mar 8 at 8:56

Trung Le

284

answered Mar 8 at 8:56

Trung Le

284

answered Mar 8 at 8:56

Trung Le

284

answered Mar 8 at 8:56

Trung Le

284

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2