![]() ![]() ![]() time () - start_time ) / 60 test = str ( elapsed_time ) " \n " read. append ( item ) file = file print ( file ) i = 0 read = open ( "read_snappy.txt", mode = "a" ) write = open ( "write_snappy.txt", mode = "a" ) readz = open ( "read_zstd.txt", mode = "a" ) writez = open ( "write_zstd.txt", mode = "a" ) while i < 100 : start_time = time. glob ( "RC_*.parquet" ) open file = for item in files : file. Import time import pathlib import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import altair as alt p = pathlib. Read and import to pandas from the zstd parquet.I then open one of my snappy compressed files containing 4,000,000 rows and around 90 columns. I first create some text files to log the time spent on each operation. Below is all of the code I used to test this.įirst up is the actual test. I recently decided to see if it was worth the extra code to use pyarrow rather than pandas to read and package this data in order to save some space on my hard drive. I recently became aware of zstandard which promises smaller sizes but similar read speeds as snappy. This uses about twice the amount of space as the bz2 files did but can be read thousands of times faster so much easier for data analysis. In the process of extracting from its original bz2 compression I decided to put them all into parquet files due to its availability and ease of use in other languages as well as being just able to do everything I need of it.īy default pandas and dask output their parquet using snappy for compression. I am working on a project that has a lot of data. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |