How does Pytorch Dataloader handle variable size data?How do you split a list into evenly sized chunks?How do you change the size of figures drawn with matplotlib?How do I pass a variable by reference?How to check file size in python?How does Python's super() work with multiple inheritance?How to access environment variable values?How to read a text file into a string variable and strip newlines?How do I write JSON data to a file?How does the @property decorator work?PyTorch DataLoader

What does 사자 in this picture means?

Stereotypical names

Would it be legal for a US State to ban exports of a natural resource?

How do I rename a LINUX host without needing to reboot for the rename to take effect?

Giant Toughroad SLR 2 for 200 miles in two days, will it make it?

Why are on-board computers allowed to change controls without notifying the pilots?

How can I raise concerns with a new DM about XP splitting?

What is the oldest known work of fiction?

Should a half Jewish man be discouraged from marrying a Jewess?

How will losing mobility of one hand affect my career as a programmer?

Why are all the doors on Ferenginar (the Ferengi home world) far shorter than the average Ferengi?

Indicating multiple different modes of speech (fantasy language or telepathy)

The One-Electron Universe postulate is true - what simple change can I make to change the whole universe?

Why does this part of the Space Shuttle launch pad seem to be floating in air?

Is it okay / does it make sense for another player to join a running game of Munchkin?

What to do when my ideas aren't chosen, when I strongly disagree with the chosen solution?

Is there a good way to store credentials outside of a password manager?

Can the harmonic series explain the origin of the major scale?

What if somebody invests in my application?

How to color a zone in Tikz

Have I saved too much for retirement so far?

What is the term when two people sing in harmony, but they aren't singing the same notes?

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

Greatest common substring



How does Pytorch Dataloader handle variable size data?


How do you split a list into evenly sized chunks?How do you change the size of figures drawn with matplotlib?How do I pass a variable by reference?How to check file size in python?How does Python's super() work with multiple inheritance?How to access environment variable values?How to read a text file into a string variable and strip newlines?How do I write JSON data to a file?How does the @property decorator work?PyTorch DataLoader













3















I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.



0 24104 27359 6684
0 24104 27359
1 16742 31529 31485
1 16742 31529
2 6579 19316 13091 7181 6579 19316 13091
2 6579 19316 13091 7181 6579 19316
2 6579 19316 13091 7181 6579 19316 13091 6579
2 6579 19316 13091 7181 6579
4 19577 21608
4 19577 21608
4 19577 21608 18373
5 3541 9529
5 3541 9529
6 6832 19218 14144
6 6832 19218
7 9751 23424 25067 12606 26245 23083 12606


I define a custom dataset to handle my click log data.



import torch.utils.data as data
class ClickLogDataset(data.Dataset):
def __init__(self, data_path):
self.data_path = data_path
self.uids = []
self.streams = []

with open(self.data_path, 'r') as fdata:
for row in fdata:
row = row.strip('n').split('t')
self.uids.append(int(row[0]))
self.streams.append(list(map(int, row[1:])))

def __len__(self):
return len(self.uids)

def __getitem__(self, idx):
uid, stream = self.uids[idx], self.streams[idx]
return uid, stream


Then I use a DataLoader to retrieve mini batches from the data for training.



from torch.utils.data.dataloader import DataLoader
clicklog_dataset = ClickLogDataset(data_path)
clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

for uid_batch, stream_batch in stream_data_loader:
print(uid_batch)
print(stream_batch)


The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?



#stream_batch
[tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
19577, 3541, 3541, 6832, 6832, 9751])]









share|improve this question


























    3















    I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.



    0 24104 27359 6684
    0 24104 27359
    1 16742 31529 31485
    1 16742 31529
    2 6579 19316 13091 7181 6579 19316 13091
    2 6579 19316 13091 7181 6579 19316
    2 6579 19316 13091 7181 6579 19316 13091 6579
    2 6579 19316 13091 7181 6579
    4 19577 21608
    4 19577 21608
    4 19577 21608 18373
    5 3541 9529
    5 3541 9529
    6 6832 19218 14144
    6 6832 19218
    7 9751 23424 25067 12606 26245 23083 12606


    I define a custom dataset to handle my click log data.



    import torch.utils.data as data
    class ClickLogDataset(data.Dataset):
    def __init__(self, data_path):
    self.data_path = data_path
    self.uids = []
    self.streams = []

    with open(self.data_path, 'r') as fdata:
    for row in fdata:
    row = row.strip('n').split('t')
    self.uids.append(int(row[0]))
    self.streams.append(list(map(int, row[1:])))

    def __len__(self):
    return len(self.uids)

    def __getitem__(self, idx):
    uid, stream = self.uids[idx], self.streams[idx]
    return uid, stream


    Then I use a DataLoader to retrieve mini batches from the data for training.



    from torch.utils.data.dataloader import DataLoader
    clicklog_dataset = ClickLogDataset(data_path)
    clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

    for uid_batch, stream_batch in stream_data_loader:
    print(uid_batch)
    print(stream_batch)


    The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?



    #stream_batch
    [tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
    19577, 3541, 3541, 6832, 6832, 9751])]









    share|improve this question
























      3












      3








      3


      1






      I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.



      0 24104 27359 6684
      0 24104 27359
      1 16742 31529 31485
      1 16742 31529
      2 6579 19316 13091 7181 6579 19316 13091
      2 6579 19316 13091 7181 6579 19316
      2 6579 19316 13091 7181 6579 19316 13091 6579
      2 6579 19316 13091 7181 6579
      4 19577 21608
      4 19577 21608
      4 19577 21608 18373
      5 3541 9529
      5 3541 9529
      6 6832 19218 14144
      6 6832 19218
      7 9751 23424 25067 12606 26245 23083 12606


      I define a custom dataset to handle my click log data.



      import torch.utils.data as data
      class ClickLogDataset(data.Dataset):
      def __init__(self, data_path):
      self.data_path = data_path
      self.uids = []
      self.streams = []

      with open(self.data_path, 'r') as fdata:
      for row in fdata:
      row = row.strip('n').split('t')
      self.uids.append(int(row[0]))
      self.streams.append(list(map(int, row[1:])))

      def __len__(self):
      return len(self.uids)

      def __getitem__(self, idx):
      uid, stream = self.uids[idx], self.streams[idx]
      return uid, stream


      Then I use a DataLoader to retrieve mini batches from the data for training.



      from torch.utils.data.dataloader import DataLoader
      clicklog_dataset = ClickLogDataset(data_path)
      clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

      for uid_batch, stream_batch in stream_data_loader:
      print(uid_batch)
      print(stream_batch)


      The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?



      #stream_batch
      [tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
      19577, 3541, 3541, 6832, 6832, 9751])]









      share|improve this question














      I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.



      0 24104 27359 6684
      0 24104 27359
      1 16742 31529 31485
      1 16742 31529
      2 6579 19316 13091 7181 6579 19316 13091
      2 6579 19316 13091 7181 6579 19316
      2 6579 19316 13091 7181 6579 19316 13091 6579
      2 6579 19316 13091 7181 6579
      4 19577 21608
      4 19577 21608
      4 19577 21608 18373
      5 3541 9529
      5 3541 9529
      6 6832 19218 14144
      6 6832 19218
      7 9751 23424 25067 12606 26245 23083 12606


      I define a custom dataset to handle my click log data.



      import torch.utils.data as data
      class ClickLogDataset(data.Dataset):
      def __init__(self, data_path):
      self.data_path = data_path
      self.uids = []
      self.streams = []

      with open(self.data_path, 'r') as fdata:
      for row in fdata:
      row = row.strip('n').split('t')
      self.uids.append(int(row[0]))
      self.streams.append(list(map(int, row[1:])))

      def __len__(self):
      return len(self.uids)

      def __getitem__(self, idx):
      uid, stream = self.uids[idx], self.streams[idx]
      return uid, stream


      Then I use a DataLoader to retrieve mini batches from the data for training.



      from torch.utils.data.dataloader import DataLoader
      clicklog_dataset = ClickLogDataset(data_path)
      clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

      for uid_batch, stream_batch in stream_data_loader:
      print(uid_batch)
      print(stream_batch)


      The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?



      #stream_batch
      [tensor([24104, 24104, 16742, 16742, 6579, 6579, 6579, 6579, 19577, 19577,
      19577, 3541, 3541, 6832, 6832, 9751])]






      python pytorch tensor variable-length






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 7 at 10:08









      Trung LeTrung Le

      284




      284






















          2 Answers
          2






          active

          oldest

          votes


















          0














          So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.






          share|improve this answer






























            1














            As @Jatentaki suggested, I wrote my custom collate function and it worked fine.



            def get_max_length(x):
            return len(max(x, key=len))

            def pad_sequence(seq):
            def _pad(_it, _max_len):
            return [0] * (_max_len - len(_it)) + _it
            return [_pad(it, get_max_length(seq)) for it in seq]

            def custom_collate(batch):
            transposed = zip(*batch)
            lst = []
            for samples in transposed:
            if isinstance(samples[0], int):
            lst.append(torch.LongTensor(samples))
            elif isinstance(samples[0], float):
            lst.append(torch.DoubleTensor(samples))
            elif isinstance(samples[0], collections.Sequence):
            lst.append(torch.LongTensor(pad_sequence(samples)))
            return lst

            stream_dataset = StreamDataset(data_path)
            stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset,
            batch_size=batch_size,
            collate_fn=custom_collate,
            shuffle=False)





            share|improve this answer






















              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55041080%2fhow-does-pytorch-dataloader-handle-variable-size-data%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0














              So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.






              share|improve this answer



























                0














                So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.






                share|improve this answer

























                  0












                  0








                  0







                  So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.






                  share|improve this answer













                  So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Mar 7 at 10:23









                  JatentakiJatentaki

                  2,095916




                  2,095916























                      1














                      As @Jatentaki suggested, I wrote my custom collate function and it worked fine.



                      def get_max_length(x):
                      return len(max(x, key=len))

                      def pad_sequence(seq):
                      def _pad(_it, _max_len):
                      return [0] * (_max_len - len(_it)) + _it
                      return [_pad(it, get_max_length(seq)) for it in seq]

                      def custom_collate(batch):
                      transposed = zip(*batch)
                      lst = []
                      for samples in transposed:
                      if isinstance(samples[0], int):
                      lst.append(torch.LongTensor(samples))
                      elif isinstance(samples[0], float):
                      lst.append(torch.DoubleTensor(samples))
                      elif isinstance(samples[0], collections.Sequence):
                      lst.append(torch.LongTensor(pad_sequence(samples)))
                      return lst

                      stream_dataset = StreamDataset(data_path)
                      stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset,
                      batch_size=batch_size,
                      collate_fn=custom_collate,
                      shuffle=False)





                      share|improve this answer



























                        1














                        As @Jatentaki suggested, I wrote my custom collate function and it worked fine.



                        def get_max_length(x):
                        return len(max(x, key=len))

                        def pad_sequence(seq):
                        def _pad(_it, _max_len):
                        return [0] * (_max_len - len(_it)) + _it
                        return [_pad(it, get_max_length(seq)) for it in seq]

                        def custom_collate(batch):
                        transposed = zip(*batch)
                        lst = []
                        for samples in transposed:
                        if isinstance(samples[0], int):
                        lst.append(torch.LongTensor(samples))
                        elif isinstance(samples[0], float):
                        lst.append(torch.DoubleTensor(samples))
                        elif isinstance(samples[0], collections.Sequence):
                        lst.append(torch.LongTensor(pad_sequence(samples)))
                        return lst

                        stream_dataset = StreamDataset(data_path)
                        stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset,
                        batch_size=batch_size,
                        collate_fn=custom_collate,
                        shuffle=False)





                        share|improve this answer

























                          1












                          1








                          1







                          As @Jatentaki suggested, I wrote my custom collate function and it worked fine.



                          def get_max_length(x):
                          return len(max(x, key=len))

                          def pad_sequence(seq):
                          def _pad(_it, _max_len):
                          return [0] * (_max_len - len(_it)) + _it
                          return [_pad(it, get_max_length(seq)) for it in seq]

                          def custom_collate(batch):
                          transposed = zip(*batch)
                          lst = []
                          for samples in transposed:
                          if isinstance(samples[0], int):
                          lst.append(torch.LongTensor(samples))
                          elif isinstance(samples[0], float):
                          lst.append(torch.DoubleTensor(samples))
                          elif isinstance(samples[0], collections.Sequence):
                          lst.append(torch.LongTensor(pad_sequence(samples)))
                          return lst

                          stream_dataset = StreamDataset(data_path)
                          stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset,
                          batch_size=batch_size,
                          collate_fn=custom_collate,
                          shuffle=False)





                          share|improve this answer













                          As @Jatentaki suggested, I wrote my custom collate function and it worked fine.



                          def get_max_length(x):
                          return len(max(x, key=len))

                          def pad_sequence(seq):
                          def _pad(_it, _max_len):
                          return [0] * (_max_len - len(_it)) + _it
                          return [_pad(it, get_max_length(seq)) for it in seq]

                          def custom_collate(batch):
                          transposed = zip(*batch)
                          lst = []
                          for samples in transposed:
                          if isinstance(samples[0], int):
                          lst.append(torch.LongTensor(samples))
                          elif isinstance(samples[0], float):
                          lst.append(torch.DoubleTensor(samples))
                          elif isinstance(samples[0], collections.Sequence):
                          lst.append(torch.LongTensor(pad_sequence(samples)))
                          return lst

                          stream_dataset = StreamDataset(data_path)
                          stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset,
                          batch_size=batch_size,
                          collate_fn=custom_collate,
                          shuffle=False)






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Mar 8 at 8:56









                          Trung LeTrung Le

                          284




                          284



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55041080%2fhow-does-pytorch-dataloader-handle-variable-size-data%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              1928 у кіно

                              Захаров Федір Захарович

                              Ель Греко