PySpark 2.2.0 : 'numpy.ndarray' object has no attribute 'indices'How to sort a list of objects based on an attribute of the objects?How to know if an object has an attribute in PythonDetermine the type of an object?How to get a value from the Row object in Spark Dataframe?Count number of elements in each pyspark RDD partitionPySpark mllib Logistic Regression error “List object has no attribute first”Parse JSON Data and save to MongoDB in PySparkdataframe to rdd python / spark / pysparkUnsure how to reproduce python code on pysparkTimedelta in Pyspark Dataframes - TypeError

I would say: "You are another teacher", but she is a woman and I am a man

Can a virus destroy the BIOS of a modern computer?

Venezuelan girlfriend wants to travel the USA to be with me. What is the process?

Is "remove commented out code" correct English?

Is it inappropriate for a student to attend their mentor's dissertation defense?

How to tell a function to use the default argument values?

What does the expression "A Mann!" means

Why was the shrinking from 8″ made only to 5.25″ and not smaller (4″ or less)?

Why no variance term in Bayesian logistic regression?

What exploit Are these user agents trying to use?

Personal Teleportation: From Rags to Riches

How do conventional missiles fly?

What killed these X2 caps?

Why is this clock signal connected to a capacitor to gnd?

Why didn't Miles's spider sense work before?

What type of content (depth/breadth) is expected for a short presentation for Asst Professor interview in the UK?

CAST throwing error when run in stored procedure but not when run as raw query

How to prevent "they're falling in love" trope

Assassin's bullet with mercury

Why can't we play rap on piano?

Why didn't Boeing produce its own regional jet?

Determining Impedance With An Antenna Analyzer

In 'Revenger,' what does 'cove' come from?

Why would the Red Woman birth a shadow if she worshipped the Lord of the Light?



PySpark 2.2.0 : 'numpy.ndarray' object has no attribute 'indices'


How to sort a list of objects based on an attribute of the objects?How to know if an object has an attribute in PythonDetermine the type of an object?How to get a value from the Row object in Spark Dataframe?Count number of elements in each pyspark RDD partitionPySpark mllib Logistic Regression error “List object has no attribute first”Parse JSON Data and save to MongoDB in PySparkdataframe to rdd python / spark / pysparkUnsure how to reproduce python code on pysparkTimedelta in Pyspark Dataframes - TypeError













4















Task



I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).



Script



def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])


Issue



When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:




AttributeError: 'numpy.ndarray' object has no attribute 'indices'




The main part to consider is:




data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])




I believe this chunk has to do with the error, but I cannot understand why the exception is telling about __numpy.ndarray__ if I'm doing the calculations through mapping that __lambda expression__ whose taking as argument a __SparseVector__ (created with the __assembler__).



Any suggestions? Does anyone maybe know what I'm doing wrong?










share|improve this question




























    4















    Task



    I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).



    Script



    def score_clustering(dataframe):
    assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
    data_transformed = assembler.transform(dataframe)
    data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
    count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])


    Issue



    When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:




    AttributeError: 'numpy.ndarray' object has no attribute 'indices'




    The main part to consider is:




    data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])




    I believe this chunk has to do with the error, but I cannot understand why the exception is telling about __numpy.ndarray__ if I'm doing the calculations through mapping that __lambda expression__ whose taking as argument a __SparseVector__ (created with the __assembler__).



    Any suggestions? Does anyone maybe know what I'm doing wrong?










    share|improve this question


























      4












      4








      4








      Task



      I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).



      Script



      def score_clustering(dataframe):
      assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
      data_transformed = assembler.transform(dataframe)
      data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
      count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])


      Issue



      When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:




      AttributeError: 'numpy.ndarray' object has no attribute 'indices'




      The main part to consider is:




      data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])




      I believe this chunk has to do with the error, but I cannot understand why the exception is telling about __numpy.ndarray__ if I'm doing the calculations through mapping that __lambda expression__ whose taking as argument a __SparseVector__ (created with the __assembler__).



      Any suggestions? Does anyone maybe know what I'm doing wrong?










      share|improve this question
















      Task



      I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).



      Script



      def score_clustering(dataframe):
      assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
      data_transformed = assembler.transform(dataframe)
      data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
      count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])


      Issue



      When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:




      AttributeError: 'numpy.ndarray' object has no attribute 'indices'




      The main part to consider is:




      data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])




      I believe this chunk has to do with the error, but I cannot understand why the exception is telling about __numpy.ndarray__ if I'm doing the calculations through mapping that __lambda expression__ whose taking as argument a __SparseVector__ (created with the __assembler__).



      Any suggestions? Does anyone maybe know what I'm doing wrong?







      python pyspark






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 9 at 13:26







      David Arango Sampayo

















      asked Mar 7 at 22:03









      David Arango SampayoDavid Arango Sampayo

      233




      233






















          1 Answer
          1






          active

          oldest

          votes


















          1














          There are two problems here. The first one is in indices.size call, indices and size are two different attributes of SparseVector class, size is the complete vector size and indices are the vector indices whose values are non-zero, but size is not a indices attribute. So, assuming that all your vectors are instances of SparseVector class:



          from pyspark.ml.linalg import Vectors

          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.sparse(4, [], [])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()

          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| (4,[],[])|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The solution is len function:



          df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| (4,[],[])| 0|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+


          And here comes the second problem: VectorAssembler does not always generate SparseVectors, depending on what is more efficient, SparseVector or DenseVectors can be generated (based on the number of zeros that your original vector has). For example, suppose the next data frame:



          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.dense([1., 1., 1., 1.])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()
          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| [1.0,1.0,1.0,1.0]|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The document 1 is a DenseVector and the previos solution does not work because DenseVectors has not indices attribute, so you have to use a more general representation of vectors to work with a DataFrame which contains both sparse and dense vectors, for example numpy:



          import numpy as np
          df = df.rdd.map(lambda x: (x[0],
          x[1],
          np.nonzero(x[1])[0].size))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| [1.0,1.0,1.0,1.0]| 4|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+





          share|improve this answer

























          • Hello Amanda. Your answer worked plenty. It solved the issue. I was not aware about what you said: VectorAssembler not always create SparseVectors, thanks for the info.

            – David Arango Sampayo
            Mar 21 at 13:35











          • VectorAssembler generates SparseVectors or DenseVectors based on the number of zero features in each row. If one row has a lot of zero features then a SparseVector will be created but, in the opposite case (a lot of non-zero features), then a DenseVector is generated because is more expensive create a tuple with the non-zero indices and non-zero values. Potentially, it could duplicate the consumed memory with two vectors of the same size as the original features vector.

            – Amanda
            Mar 21 at 15:47











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55053521%2fpyspark-2-2-0-numpy-ndarray-object-has-no-attribute-indices%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          There are two problems here. The first one is in indices.size call, indices and size are two different attributes of SparseVector class, size is the complete vector size and indices are the vector indices whose values are non-zero, but size is not a indices attribute. So, assuming that all your vectors are instances of SparseVector class:



          from pyspark.ml.linalg import Vectors

          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.sparse(4, [], [])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()

          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| (4,[],[])|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The solution is len function:



          df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| (4,[],[])| 0|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+


          And here comes the second problem: VectorAssembler does not always generate SparseVectors, depending on what is more efficient, SparseVector or DenseVectors can be generated (based on the number of zeros that your original vector has). For example, suppose the next data frame:



          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.dense([1., 1., 1., 1.])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()
          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| [1.0,1.0,1.0,1.0]|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The document 1 is a DenseVector and the previos solution does not work because DenseVectors has not indices attribute, so you have to use a more general representation of vectors to work with a DataFrame which contains both sparse and dense vectors, for example numpy:



          import numpy as np
          df = df.rdd.map(lambda x: (x[0],
          x[1],
          np.nonzero(x[1])[0].size))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| [1.0,1.0,1.0,1.0]| 4|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+





          share|improve this answer

























          • Hello Amanda. Your answer worked plenty. It solved the issue. I was not aware about what you said: VectorAssembler not always create SparseVectors, thanks for the info.

            – David Arango Sampayo
            Mar 21 at 13:35











          • VectorAssembler generates SparseVectors or DenseVectors based on the number of zero features in each row. If one row has a lot of zero features then a SparseVector will be created but, in the opposite case (a lot of non-zero features), then a DenseVector is generated because is more expensive create a tuple with the non-zero indices and non-zero values. Potentially, it could duplicate the consumed memory with two vectors of the same size as the original features vector.

            – Amanda
            Mar 21 at 15:47















          1














          There are two problems here. The first one is in indices.size call, indices and size are two different attributes of SparseVector class, size is the complete vector size and indices are the vector indices whose values are non-zero, but size is not a indices attribute. So, assuming that all your vectors are instances of SparseVector class:



          from pyspark.ml.linalg import Vectors

          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.sparse(4, [], [])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()

          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| (4,[],[])|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The solution is len function:



          df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| (4,[],[])| 0|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+


          And here comes the second problem: VectorAssembler does not always generate SparseVectors, depending on what is more efficient, SparseVector or DenseVectors can be generated (based on the number of zeros that your original vector has). For example, suppose the next data frame:



          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.dense([1., 1., 1., 1.])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()
          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| [1.0,1.0,1.0,1.0]|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The document 1 is a DenseVector and the previos solution does not work because DenseVectors has not indices attribute, so you have to use a more general representation of vectors to work with a DataFrame which contains both sparse and dense vectors, for example numpy:



          import numpy as np
          df = df.rdd.map(lambda x: (x[0],
          x[1],
          np.nonzero(x[1])[0].size))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| [1.0,1.0,1.0,1.0]| 4|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+





          share|improve this answer

























          • Hello Amanda. Your answer worked plenty. It solved the issue. I was not aware about what you said: VectorAssembler not always create SparseVectors, thanks for the info.

            – David Arango Sampayo
            Mar 21 at 13:35











          • VectorAssembler generates SparseVectors or DenseVectors based on the number of zero features in each row. If one row has a lot of zero features then a SparseVector will be created but, in the opposite case (a lot of non-zero features), then a DenseVector is generated because is more expensive create a tuple with the non-zero indices and non-zero values. Potentially, it could duplicate the consumed memory with two vectors of the same size as the original features vector.

            – Amanda
            Mar 21 at 15:47













          1












          1








          1







          There are two problems here. The first one is in indices.size call, indices and size are two different attributes of SparseVector class, size is the complete vector size and indices are the vector indices whose values are non-zero, but size is not a indices attribute. So, assuming that all your vectors are instances of SparseVector class:



          from pyspark.ml.linalg import Vectors

          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.sparse(4, [], [])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()

          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| (4,[],[])|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The solution is len function:



          df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| (4,[],[])| 0|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+


          And here comes the second problem: VectorAssembler does not always generate SparseVectors, depending on what is more efficient, SparseVector or DenseVectors can be generated (based on the number of zeros that your original vector has). For example, suppose the next data frame:



          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.dense([1., 1., 1., 1.])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()
          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| [1.0,1.0,1.0,1.0]|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The document 1 is a DenseVector and the previos solution does not work because DenseVectors has not indices attribute, so you have to use a more general representation of vectors to work with a DataFrame which contains both sparse and dense vectors, for example numpy:



          import numpy as np
          df = df.rdd.map(lambda x: (x[0],
          x[1],
          np.nonzero(x[1])[0].size))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| [1.0,1.0,1.0,1.0]| 4|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+





          share|improve this answer















          There are two problems here. The first one is in indices.size call, indices and size are two different attributes of SparseVector class, size is the complete vector size and indices are the vector indices whose values are non-zero, but size is not a indices attribute. So, assuming that all your vectors are instances of SparseVector class:



          from pyspark.ml.linalg import Vectors

          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.sparse(4, [], [])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()

          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| (4,[],[])|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The solution is len function:



          df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| (4,[],[])| 0|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+


          And here comes the second problem: VectorAssembler does not always generate SparseVectors, depending on what is more efficient, SparseVector or DenseVectors can be generated (based on the number of zeros that your original vector has). For example, suppose the next data frame:



          df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
          (1, Vectors.dense([1., 1., 1., 1.])),
          (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
          ["documento", "variables"])

          df.show()
          +---------+--------------------+
          |documento| variables|
          +---------+--------------------+
          | 0|(4,[0,1],[11.0,2.0])|
          | 1| [1.0,1.0,1.0,1.0]|
          | 3|(4,[0,1,2],[2.0,2...|
          +---------+--------------------+


          The document 1 is a DenseVector and the previos solution does not work because DenseVectors has not indices attribute, so you have to use a more general representation of vectors to work with a DataFrame which contains both sparse and dense vectors, for example numpy:



          import numpy as np
          df = df.rdd.map(lambda x: (x[0],
          x[1],
          np.nonzero(x[1])[0].size))
          .toDF(["documento", "variables", "frecuencia"])
          df.show()
          +---------+--------------------+----------+
          |documento| variables|frecuencia|
          +---------+--------------------+----------+
          | 0|(4,[0,1],[11.0,2.0])| 2|
          | 1| [1.0,1.0,1.0,1.0]| 4|
          | 3|(4,[0,1,2],[2.0,2...| 3|
          +---------+--------------------+----------+






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 14 at 20:50

























          answered Mar 14 at 19:33









          AmandaAmanda

          3761314




          3761314












          • Hello Amanda. Your answer worked plenty. It solved the issue. I was not aware about what you said: VectorAssembler not always create SparseVectors, thanks for the info.

            – David Arango Sampayo
            Mar 21 at 13:35











          • VectorAssembler generates SparseVectors or DenseVectors based on the number of zero features in each row. If one row has a lot of zero features then a SparseVector will be created but, in the opposite case (a lot of non-zero features), then a DenseVector is generated because is more expensive create a tuple with the non-zero indices and non-zero values. Potentially, it could duplicate the consumed memory with two vectors of the same size as the original features vector.

            – Amanda
            Mar 21 at 15:47

















          • Hello Amanda. Your answer worked plenty. It solved the issue. I was not aware about what you said: VectorAssembler not always create SparseVectors, thanks for the info.

            – David Arango Sampayo
            Mar 21 at 13:35











          • VectorAssembler generates SparseVectors or DenseVectors based on the number of zero features in each row. If one row has a lot of zero features then a SparseVector will be created but, in the opposite case (a lot of non-zero features), then a DenseVector is generated because is more expensive create a tuple with the non-zero indices and non-zero values. Potentially, it could duplicate the consumed memory with two vectors of the same size as the original features vector.

            – Amanda
            Mar 21 at 15:47
















          Hello Amanda. Your answer worked plenty. It solved the issue. I was not aware about what you said: VectorAssembler not always create SparseVectors, thanks for the info.

          – David Arango Sampayo
          Mar 21 at 13:35





          Hello Amanda. Your answer worked plenty. It solved the issue. I was not aware about what you said: VectorAssembler not always create SparseVectors, thanks for the info.

          – David Arango Sampayo
          Mar 21 at 13:35













          VectorAssembler generates SparseVectors or DenseVectors based on the number of zero features in each row. If one row has a lot of zero features then a SparseVector will be created but, in the opposite case (a lot of non-zero features), then a DenseVector is generated because is more expensive create a tuple with the non-zero indices and non-zero values. Potentially, it could duplicate the consumed memory with two vectors of the same size as the original features vector.

          – Amanda
          Mar 21 at 15:47





          VectorAssembler generates SparseVectors or DenseVectors based on the number of zero features in each row. If one row has a lot of zero features then a SparseVector will be created but, in the opposite case (a lot of non-zero features), then a DenseVector is generated because is more expensive create a tuple with the non-zero indices and non-zero values. Potentially, it could duplicate the consumed memory with two vectors of the same size as the original features vector.

          – Amanda
          Mar 21 at 15:47



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55053521%2fpyspark-2-2-0-numpy-ndarray-object-has-no-attribute-indices%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          1928 у кіно

          Захаров Федір Захарович

          Ель Греко