parquetwrite
Write columnar data to Parquet file
Description
parquetwrite(
specifies additional options with one or more name-value pair arguments. For example, you can specifyfilename
,T
,Name,Value
)'VariableCompression'
to change the compression algorithm used, or'Version'
to write the data to a Parquet 1.0 file.
Examples
Write Table or Timetable to Parquet File
Write tabular data into a Parquet file and compare the size of the same tabular data in.csv
and.parquet
file formats.
Read the tabular data from the fileoutages.csv
into a table.
T = readtable('outages.csv');
Write the data to Parquet file format. By default, theparquetwrite
function uses theSnappy
压缩方案. To specify other compression schemes see'VariableCompression'
name-value pair.
parquetwrite('outagesDefault.parquet',T)
Get the file sizes and compute the ratio of the size of tabular data in the.csv
format to size of the same data in.parquet
format.
Get size of.csv
file.
fcsv = dir(which('outages.csv')); size_csv = fcsv.bytes
size_csv = 101040
Get size of.parquet
file.
fparquet = dir('outagesDefault.parquet'); size_parquet = fparquet.bytes
size_parquet = 44881
Compute the ratio.
sizeRatio = ( size_parquet/size_csv )*100 ; disp(['Size Ratio = ', num2str(sizeRatio)'% of original size'])
Size Ratio = 44.419% of original size
Write Nested Data Structure to Parquet File
Create a nested data structure and write it to a Parquet file.
Create a table with one nested layer of data.
Names = ["Akane";"Omar";"Maria"]; NumCourse = [5; 3; 6]; Courses = {["Calculus I";"U.S. History";"English Literature";"Studio Art";"Organic Chemistry II"]; ["U.S. History";"Art History";"Philosphy"]; ["Calculus II";"Philosphy II";"Ballet";"Music Theory";"Organic Chemistry I";"English Literature"]}; data = table(Names,NumCourse,Courses)
data=3×3 tableNames NumCourse Courses _______ _________ ____________ "Akane" 5 {5x1 string} "Omar" 3 {3x1 string} "Maria" 6 {6x1 string}
Write the nested data structure to a Parquet file.
parquetwrite("StudentCourseLoads.parq",data)
Read the nested Parquet data structure.
t2 = parquetread("StudentCourseLoads.parq")
t2=3×3 tableNames NumCourse Courses _______ _________ ____________ "Akane" 5 {5x1 string} "Omar" 3 {3x1 string} "Maria" 6 {6x1 string}
Input Arguments
filename
—Name of output Parquet file
character vector|string scalar
Name of output Parquet file, specified as a character vector or string scalar.
Depending on the location you are writing to,filename
can take on one of these forms.
Location |
Form |
||||||||
---|---|---|---|---|---|---|---|---|---|
Current folder | To write to the current folder, specify the name of the file in Example: |
||||||||
Other folders |
To write to a folder different from the current folder, specify the full or relative path name in Example: Example: |
||||||||
Remote Location |
To write to a remote location,
Based on the remote location,
有关更多信息,请参见Work with Remote Data. Example: |
Useparquetwrite
to export nested cell arrays as LIST arrays. Nested data is beneficial for working with irregularly structured data such as jagged arrays.
Data Types:char
|string
T
—Input data
table|timetable
Input data, specified as a table or timetable.
Name-Value Arguments
Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN
, whereName
is the argument name andValue
is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.
Before R2021a, use commas to separate each name and value, and encloseName
in quotes.
Example:parquetwrite(filename,T,'VariableCompression','gzip','Version','1.0')
VariableCompression
—Compression scheme names
'snappy'
(default) |'brotli'
|'gzip'
|'uncompressed'
|cell array of character vectors|string vector
Compression scheme names, specified as one of these values:
'snappy'
,'brotli'
,'gzip'
, or'uncompressed'
. If you specify one compression algorithm thenparquetwrite
compresses all variables using the same algorithm.Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.
In general,'snappy'
has better performance for reading and writing,'gzip'
has a higher compression ratio at the cost of more CPU processing time, and'brotli'
typically produces the smallest file size at the cost of compression speed.
Example:parquetwrite('myData.parquet', T, 'VariableCompression', 'brotli')
Example:parquetwrite('myData.parquet', T, 'VariableCompression', {'brotli' 'snappy' 'gzip'})
VariableEncoding
—Encoding scheme names
'auto'
(default) |'dictionary'
|'plain'
|cell array of character vectors|string vector
Encoding scheme names, specified as one of these values:
'auto'
—parquetwrite
uses'plain'
encoding for logical variables, and'dictionary'
encoding for all others.'dictionary'
,'plain'
— If you specify one encoding scheme thenparquetwrite
encodes all variables with that scheme.Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.
In general,'dictionary'
encoding results in smaller file sizes, but'plain'
encoding can be faster for variables that do not contain many repeated values. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically reverts to plain encoding. For more information on Parquet encodings, seeParquet encoding definitions.
Example:parquetwrite('myData.parquet', T, 'VariableEncoding', 'plain')
Example:parquetwrite('myData.parquet', T, 'VariableEncoding', {'plain' 'dictionary' 'plain'})
RowGroupHeights
—Number of rows to write per output row group
nonnegative numeric scalar|vector of nonnegative integers
Number of rows to write per output row group, specified as a nonnegative numeric scalar or vector of nonnegative integers.
If you specify a scalar, the scalar value sets the height of all row groups in the output Parquet file. The last row group may contain fewer rows if there is not an exact multiple.
If you specify a vector, each value in the vector sets the height of a corresponding row group in the output Parquet file. The sum of all the values in the vector must match the height of the input table.
A row group is the smallest subset of a Parquet file that can be read into memory at once. Reducing the row group height helps the data fit into memory when reading. Row group height also affects the performance of filtering operations on a Parquet data set because a larger row group height can be used to filter larger amounts of data when reading.
IfRowGroupHeights
is unspecified and the input table exceeds 67108864 rows, the number of row groups in the output file is equal tofloor(TotalNumberOfRows/67108864)+1
.
Example:RowGroupHeights=100
Example:RowGroupHeights=[300, 400, 500, 0, 268]
Version
—拼花版使用
'2.0'
(default) |“1.0”
拼花版使用, specified as either“1.0”
or'2.0'
. By default,'2.0'
offers the most efficient storage, but you can select“1.0”
for the broadest compatibility with external applications that support the Parquet format.
Caution
Parquet version 1.0 has a limitation that it cannot round-trip variables of typeuint32
(they are read back into MATLAB®asint64
).
Limitations
In some cases,parquetwrite
创建文件不represent the original arrayT
exactly. If you useparquetread
ordatastore
to read the files, then the result might not have the same format or contents as the original table. For more information, seeApache Parquet Data Type Mappings.
Version History
Introduced in R2019aR2022a: Determine and define row groups in Parquet file data
A Parquet file can store a range of rows as a distinct row group for increased granularity and targeted analysis.parquetread
uses theRowGroups
name-value argument to determine row groups while reading Parquet file data.parquetwrite
uses theRowGroupHeights
name-value argument to define row groups while writing Parquet file data.
R2022a: Export nested data structures
You can now export nested cell arrays as LIST arrays.
R2021b: Read and write datetimes with original time zones
Parquet files require time-zone-aware timestamps to be in the UTC time zone. When writing datetimes,parquetwrite
converts them to equivalent UTC values and stores the original time zone values in the metadata of the Parquet file.parquetread
uses the stored original time zone values to enable roundtripping.
R2021a: Use categorical data in Parquet data format
Write Parquet data that contains thecategorical
data type.
R2020a: Control encoding scheme and Parquet version when writing files
Theparquetwrite
function has two new name-value arguments:
'VariableEncoding'
controls whether a Parquet file uses plain or dictionary encoding for each variable.'Version'
specifies whether to use Parquet 1.0 or Parquet 2.0 file formatting.
R2019b: Write tabular data containing any characters
写表格数据,has variable names containing any Unicode characters, including spaces and non-ASCII characters. To write tabular data that contains arbitrary variable names, such as variable names with spaces and non-ASCII characters, set thePreserveVariableNames
parameter totrue
.
See Also
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select:.
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina(Español)
- Canada(English)
- United States(English)
Europe
- Belgium(English)
- Denmark(English)
- Deutschland(Deutsch)
- España(Español)
- Finland(English)
- France(Français)
- Ireland(English)
- Italia(Italiano)
- Luxembourg(English)
- Netherlands(English)
- Norway(English)
- Österreich(Deutsch)
- Portugal(English)
- Sweden(English)
- Switzerland
- United Kingdom(English)