parquetwrite

Write columnar data to Parquet file

collapse all in page

Syntax

parquetwrite(filename,T)

parquetwrite(filename,T,Name,Value)

Description

example

parquetwrite(filename,T)writes a table or timetableTto a Parquet 2.0 file with the filename specified infilename.

example

parquetwrite(filename,T,Name,Value)specifies additional options with one or more name-value pair arguments. For example, you can specify'VariableCompression'to change the compression algorithm used, or'Version'to write the data to a Parquet 1.0 file.

Examples

collapse all

Write Table or Timetable to Parquet File

Open Live Script

Write tabular data into a Parquet file and compare the size of the same tabular data in.csvand.parquetfile formats.

Read the tabular data from the fileoutages.csvinto a table.

T = readtable('outages.csv');

Write the data to Parquet file format. By default, theparquetwritefunction uses theSnappy压缩方案. To specify other compression schemes see'VariableCompression'name-value pair.

parquetwrite('outagesDefault.parquet',T)

Get the file sizes and compute the ratio of the size of tabular data in the.csvformat to size of the same data in.parquetformat.

Get size of.csvfile.

fcsv = dir(which('outages.csv')); size_csv = fcsv.bytes

size_csv = 101040

Get size of.parquetfile.

fparquet = dir('outagesDefault.parquet'); size_parquet = fparquet.bytes

size_parquet = 44881

Compute the ratio.

sizeRatio = ( size_parquet/size_csv )*100 ; disp(['Size Ratio = ', num2str(sizeRatio)'% of original size'])

Size Ratio = 44.419% of original size

Write Nested Data Structure to Parquet File

Open Live Script

Create a nested data structure and write it to a Parquet file.

Create a table with one nested layer of data.

Names = ["Akane";"Omar";"Maria"]; NumCourse = [5; 3; 6]; Courses = {["Calculus I";"U.S. History";"English Literature";"Studio Art";"Organic Chemistry II"]; ["U.S. History";"Art History";"Philosphy"]; ["Calculus II";"Philosphy II";"Ballet";"Music Theory";"Organic Chemistry I";"English Literature"]}; data = table(Names,NumCourse,Courses)

data=3×3 tableNames NumCourse Courses _______ _________ ____________ "Akane" 5 {5x1 string} "Omar" 3 {3x1 string} "Maria" 6 {6x1 string}

Write the nested data structure to a Parquet file.

parquetwrite("StudentCourseLoads.parq",data)

Read the nested Parquet data structure.

t2 = parquetread("StudentCourseLoads.parq")

t2=3×3 tableNames NumCourse Courses _______ _________ ____________ "Akane" 5 {5x1 string} "Omar" 3 {3x1 string} "Maria" 6 {6x1 string}

Input Arguments

collapse all

`filename`—Name of output Parquet file
character vector|string scalar

Name of output Parquet file, specified as a character vector or string scalar.

Depending on the location you are writing to,filenamecan take on one of these forms.

Location

Form

Current folder

To write to the current folder, specify the name of the file infilename.

Example:'myData.parquet'

Other folders

To write to a folder different from the current folder, specify the full or relative path name infilename.

Example:'C:\myFolder\myData.parquet'

Example:'dataDir\myData.parquet'

Remote Location

To write to a remote location,filenamemust contain the full path of the file specified as a uniform resource locator (URL) of the form:

scheme_name://path_to_file/myData.parquet

Based on the remote location,scheme_namecan be one of the values in this table.

Remote Location	`scheme_name`
Amazon S3™	`s3`
Windows Azure^®Blob Storage	`wasb`,`wasbs`
HDFS™	`hdfs`

有关更多信息,请参见Work with Remote Data.

Example:'s3://bucketname/path_to_file/myData.parquet'

Useparquetwriteto export nested cell arrays as LIST arrays. Nested data is beneficial for working with irregularly structured data such as jagged arrays.

Data Types:char|string

`T`—Input data
table|timetable

Input data, specified as a table or timetable.

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, whereNameis the argument name andValueis the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and encloseNamein quotes.

Example:parquetwrite(filename,T,'VariableCompression','gzip','Version','1.0')

`VariableCompression`—Compression scheme names
`'snappy'`(default) |`'brotli'`|`'gzip'`|`'uncompressed'`|cell array of character vectors|string vector

Compression scheme names, specified as one of these values:

'snappy','brotli','gzip', or'uncompressed'. If you specify one compression algorithm thenparquetwritecompresses all variables using the same algorithm.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.

In general,'snappy'has better performance for reading and writing,'gzip'has a higher compression ratio at the cost of more CPU processing time, and'brotli'typically produces the smallest file size at the cost of compression speed.

Example:parquetwrite('myData.parquet', T, 'VariableCompression', 'brotli')

Example:parquetwrite('myData.parquet', T, 'VariableCompression', {'brotli' 'snappy' 'gzip'})

`VariableEncoding`—Encoding scheme names
`'auto'`(default) |`'dictionary'`|`'plain'`|cell array of character vectors|string vector

Encoding scheme names, specified as one of these values:

'auto'—parquetwriteuses'plain'encoding for logical variables, and'dictionary'encoding for all others.
'dictionary','plain'— If you specify one encoding scheme thenparquetwriteencodes all variables with that scheme.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.

In general,'dictionary'encoding results in smaller file sizes, but'plain'encoding can be faster for variables that do not contain many repeated values. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically reverts to plain encoding. For more information on Parquet encodings, seeParquet encoding definitions.

Example:parquetwrite('myData.parquet', T, 'VariableEncoding', 'plain')

Example:parquetwrite('myData.parquet', T, 'VariableEncoding', {'plain' 'dictionary' 'plain'})

`RowGroupHeights`—Number of rows to write per output row group
nonnegative numeric scalar|vector of nonnegative integers

Number of rows to write per output row group, specified as a nonnegative numeric scalar or vector of nonnegative integers.

If you specify a scalar, the scalar value sets the height of all row groups in the output Parquet file. The last row group may contain fewer rows if there is not an exact multiple.
If you specify a vector, each value in the vector sets the height of a corresponding row group in the output Parquet file. The sum of all the values in the vector must match the height of the input table.

A row group is the smallest subset of a Parquet file that can be read into memory at once. Reducing the row group height helps the data fit into memory when reading. Row group height also affects the performance of filtering operations on a Parquet data set because a larger row group height can be used to filter larger amounts of data when reading.

IfRowGroupHeightsis unspecified and the input table exceeds 67108864 rows, the number of row groups in the output file is equal tofloor(TotalNumberOfRows/67108864)+1.

Example:RowGroupHeights=100

Example:RowGroupHeights=[300, 400, 500, 0, 268]

`Version`—拼花版使用
`'2.0'`(default) |`“1.0”`

拼花版使用, specified as either“1.0”or'2.0'. By default,'2.0'offers the most efficient storage, but you can select“1.0”for the broadest compatibility with external applications that support the Parquet format.

Caution

Parquet version 1.0 has a limitation that it cannot round-trip variables of typeuint32(they are read back into MATLAB^®asint64).

Limitations

In some cases,parquetwrite创建文件不represent the original arrayTexactly. If you useparquetreadordatastoreto read the files, then the result might not have the same format or contents as the original table. For more information, seeApache Parquet Data Type Mappings.

Version History

Introduced in R2019a

expand all

R2022a: Determine and define row groups in Parquet file data

A Parquet file can store a range of rows as a distinct row group for increased granularity and targeted analysis.parquetreaduses theRowGroupsname-value argument to determine row groups while reading Parquet file data.parquetwriteuses theRowGroupHeightsname-value argument to define row groups while writing Parquet file data.

R2022a: Export nested data structures

You can now export nested cell arrays as LIST arrays.

R2021b: Read and write datetimes with original time zones

Parquet files require time-zone-aware timestamps to be in the UTC time zone. When writing datetimes,parquetwriteconverts them to equivalent UTC values and stores the original time zone values in the metadata of the Parquet file.parquetreaduses the stored original time zone values to enable roundtripping.

R2021a: Use categorical data in Parquet data format

Write Parquet data that contains thecategoricaldata type.

R2020a: Control encoding scheme and Parquet version when writing files

Theparquetwritefunction has two new name-value arguments:

'VariableEncoding'controls whether a Parquet file uses plain or dictionary encoding for each variable.
'Version'specifies whether to use Parquet 1.0 or Parquet 2.0 file formatting.

R2019b: Write tabular data containing any characters

写表格数据,has variable names containing any Unicode characters, including spaces and non-ASCII characters. To write tabular data that contains arbitrary variable names, such as variable names with spaces and non-ASCII characters, set thePreserveVariableNamesparameter totrue.