• Empleos
  • Sobre nosotros
  • profesionales
    • Inicio
    • Empleos
    • Cursos y retos
  • empresas
    • Inicio
    • Publicar vacante
    • Nuestro proceso
    • Precios
    • Evaluaciones
    • Nómina
    • Blog
    • Comercial
    • Calculadora de salario

0

476
Vistas
How do I read a parquet in PySpark written from Spark?

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")

I then go to my Python notebook to read in the data:

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.

edit: I'm sing spark 2.0 in both the read and the write.

edit2: This was done in a project in Data Science Experience.

over 3 years ago · Santiago Trujillo
2 Respuestas
Responde la pregunta

0

I read parquet file in the following way:

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

sc = spark.sparkContext

# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')
over 3 years ago · Santiago Trujillo Denunciar

0

You can use parquet format of Spark Session to read parquet files. Like this:

df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")

Although, there is no difference between parquet and load functions. It might be the case that load is not able to infer the schema of data in the file (eg, some data type which is not identifiable by load or specific to parquet).

over 3 years ago · Santiago Trujillo Denunciar
Responde la pregunta
Encuentra empleos remotos

¡Descubre la nueva forma de encontrar empleo!

Top de empleos
Top categorías de empleo
Empresas
Publicar vacante Precios Nuestro proceso Comercial
Legal
Términos y condiciones Política de privacidad
© 2025 PeakU Inc. All Rights Reserved.

Andres GPT

Recomiéndame algunas ofertas
Necesito ayuda