68 datasets found
  1. New York Taxi Data 2009-2016 in Parquet Fomat

    • academictorrents.com
    bittorrent
    Updated Jul 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
    Explore at:
    bittorrent(35078948106)Available download formats
    Dataset updated
    Jul 1, 2017
    Dataset provided by
    New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
    Authors
    New York Taxi and Limousine Commission
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Area covered
    New York
    Description

    Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar

  2. Surface Water - Habitat Results

    • catalog.data.gov
    • datasets.ai
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2024). Surface Water - Habitat Results [Dataset]. https://catalog.data.gov/dataset/surface-water-habitat-results
    Explore at:
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    California State Water Resources Control Board
    Description

    This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

  3. h

    example-space-to-dataset-parquet

    • huggingface.co
    Updated Nov 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepGHS (2024). example-space-to-dataset-parquet [Dataset]. https://huggingface.co/datasets/deepghs/example-space-to-dataset-parquet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    DeepGHS
    Description

    deepghs/example-space-to-dataset-parquet dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    example-space-to-dataset-parquet

    • huggingface.co
    Updated Apr 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HannibalY (2024). example-space-to-dataset-parquet [Dataset]. https://huggingface.co/datasets/ethix/example-space-to-dataset-parquet
    Explore at:
    Dataset updated
    Apr 14, 2024
    Authors
    HannibalY
    Description

    ethix/example-space-to-dataset-parquet dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

    • zenodo.org
    zip
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

    The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

    Input data

    The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

    • raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:
      • traj_id: trajectory identifier;
      • user: user identifier;
      • lat: latitude of a trajectory sample;
      • lon: longitude of a trajectory sample;
      • time: timestamp of a sample;

    • pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:
      • osmid: POI OSM identifier
      • element_type: POI OSM element type
      • name: POI native name;
      • name:en: POI English name;
      • wikidata: POI WikiData identifier;
      • geometry: geometry associated with the POI;
      • category: POI category.

    • social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:
      • tweet_ID: identifier of the post;
      • text: post's text;
      • tweet_created: post's timestamp;
      • uid: identifier of the user who posted.

    • weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:
      • DATE: date in which the weather observation was recorded;
      • TAVG_C: average temperature in celsius;
      • DESCRIPTION: weather conditions.

    Output data: the semantically enriched Paris and New York City datasets

    Tabular Representation

    The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

    • traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:
      • (1) we filtered out trajectories having less than 2 samples;
      • (2) we filtered noisy samples inducing velocities above 300km/h:
      • (3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

    • stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:
      • datetime, which indicates when a stop segments starts;
      • leaving_datetime, which indicates when a stop segment ends;
      • uid, the trajectory user's identifier;
      • tid, the trajectory's identifier;
      • lat, the stop segment's centroid latitude;
      • lng, the stop segment's centroid longitude.
        NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).
    • moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:
      • datetime, which indicates when a sample's timestamp;
      • uid, the samples' trajectory user's identifier;
      • tid, the sample's trajectory's identifier;
      • lat, the sample's latitude;
      • lng, the sample's longitude;
      • move_id, the identifier of a move segment.
        NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

    • enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

      The first set of columns concerns a stop's charachteristics:
      • stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;
      • geometry_stop, which is a Shapely Point representing a stop's centroid;
      • geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

    There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

      • index_poi, which is the index of the associated POI in the pois.parqet file;
      • osmid, which is the identifier given by OpenStreetMap to the POI;
      • name, the POI's name;
      • wikidata, the POI identifier on wikidata;
      • category, the POI's category;
      • geometry_poi, a Shapely (multi)polygon describing the POI's geometry;
      • distance, the distance between the stop segment's centroid and the POI.

    • enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:
      • systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;
      • frequency, the number of systematic stops within a systematic stop's cluster;
      • home, the probability that the systematic stop's cluster represents the home of the associated user;
      • work, the probability that the systematic stop's cluster represents the workplace of the associated user;
      • other,

  6. Surface Water - Habitat Results

    • data.cnra.ca.gov
    • data.ca.gov
    csv, pdf, zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Habitat Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-habitat-results
    Explore at:
    pdf, csv, zipAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    California State Water Resources Control Board
    Description

    This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

    Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  7. Surface Water - Benthic Macroinvertebrate Results

    • data.cnra.ca.gov
    • data.ca.gov
    csv, pdf, zip
    Updated Jun 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Benthic Macroinvertebrate Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-benthic-macroinvertebrate-results
    Explore at:
    pdf, zip, csvAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    California State Water Resources Control Board
    Description

    Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.

    The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.

    Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  8. PSYCHE-D: predicting change in depression severity using person-generated...

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
    Explore at:
    pdf, binAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
    Description

    This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

    Dataset description

    Parquet file, with:

    • 35694 rows
    • 154 columns

    The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

    Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

    File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

    The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

    The data subset used in this work comprises the following:

    • Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study
    • Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities
    • Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month
    • Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

    From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

    The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.

  9. P

    DataSeeds.AI-Sample-Dataset-DSD Dataset

    • paperswithcode.com
    Updated Jun 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). DataSeeds.AI-Sample-Dataset-DSD Dataset [Dataset]. https://paperswithcode.com/dataset/dataseeds-ai-sample-dataset-dsd
    Explore at:
    Dataset updated
    Jun 5, 2025
    Description

    Dataset Summary The DataSeeds.AI Sample Dataset (DSD) is a high-fidelity, human-curated computer vision-ready dataset comprised of 7,772 peer-ranked, fully annotated photographic images, 350,000+ words of descriptive text, and comprehensive metadata. While the DSD is being released under an open source license, a sister dataset of over 10,000 fully annotated and segmented images is available for immediate commercial licensing, and the broader GuruShots ecosystem contains over 100 million images in its catalog.

    Each image includes multi-tier human annotations and semantic segmentation masks. Generously contributed to the community by the GuruShots photography platform, where users engage in themed competitions, the DSD uniquely captures aesthetic preference signals and high-quality technical metadata (EXIF) across an expansive diversity of photographic styles, camera types, and subject matter. The dataset is optimized for fine-tuning and evaluating multimodal vision-language models, especially in scene description and stylistic comprehension tasks.

    Technical Report - Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery Github Repo - Access the complete weights and code which were used to evaluate the DSD -- https://github.com/DataSeeds-ai/DSD-finetune-blip-llava This dataset is ready for commercial/non-commercial use. Dataset Structure Size: 7,772 images (7,010 train, 762 validation) Format: Apache Parquet files for metadata, with images in JPG format Total Size: ~4.1GB Languages: English (annotations) Annotation Quality: All annotations were verified through a multi-tier human-in-the-loop process

  10. h

    pashto_speech_20k

    • huggingface.co
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanif Rahman (2025). pashto_speech_20k [Dataset]. https://huggingface.co/datasets/ihanif/pashto_speech_20k
    Explore at:
    Dataset updated
    Apr 24, 2025
    Authors
    Hanif Rahman
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Pashto Synthetic Speech Dataset Parquet (20k)

    This dataset contains 40000 synthetic speech recordings in the Pashto language, with 20000 male voice recordings and 20000 female voice recordings, stored in Parquet format.

      Dataset Information
    

    Dataset Size: 20000 sentences Total Recordings: 40000 audio files (20000 male + 20000 female) Audio Format: WAV, 24kHz, 16-bit PCM, embedded directly in Parquet files Dataset Format: Parquet with 500MB shards Sampling Rate: 24kHz… See the full description on the dataset page: https://huggingface.co/datasets/ihanif/pashto_speech_20k.

  11. Z

    CKW Smart Meter Data

    • data.niaid.nih.gov
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barahona Garzon, Braulio (2024). CKW Smart Meter Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13304498
    Explore at:
    Dataset updated
    Sep 22, 2024
    Dataset authored and provided by
    Barahona Garzon, Braulio
    Description

    Overview

    The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.

    Data set A - anonimised smart meter data

    Data set B - aggregated smart meter data

    Contents of this data set

    This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet

    The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:

    id: the anonymized counter ID (text)

    timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)

    value_kwh: the consumption in kWh in the time window under consideration (float)

    In this archive, data from:

    | Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |

    was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.

    A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.

    It consists of the following parquet files:

    | Size | Date | Name |

    |------|------|------|

    | 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |

    | 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |

    | 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |

    | 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |

    | 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |

    | 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |

    | 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |

    | 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |

    | 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |

    | 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |

    | 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |

    | 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |

    | 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |

    | 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |

    | 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |

    | 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |

    | 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |

    | 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |

    | 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |

    | 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |

    | 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |

    | 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |

    | 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |

    | 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |

    | 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |

    | 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |

    | 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |

    | 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |

    | 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |

    | 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |

    | 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |

    | 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |

    | 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |

    | 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |

    | 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |

    | 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |

    | 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |

    | 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |

    | 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |

    | 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |

    | 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |

    | 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |

    | 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |

    | 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |

    | 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |

    | 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |

    | 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |

    | 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |

    | 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |

    | 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |

    | 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |

    | 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |

    | 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |

    | 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |

    | 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |

    | 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |

    | 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |

    | 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |

    | 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |

    | 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |

    | 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |

    | 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |

    | 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |

    | 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |

    | 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |

    where each file contains data from a single smart meter.

    Acknowledgement

    The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.

  12. EEL mouse sagittal atlas 168 genes spatial RNA data

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Borm (2023). EEL mouse sagittal atlas 168 genes spatial RNA data [Dataset]. http://doi.org/10.6084/m9.figshare.20324814.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lars Borm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw RNA locations of the mouse atlas produced by EEL FISH for 168 genes.

    RNA files are in the .parquet format which can be opened with FISHscale (https://github.com/linnarsson-lab/FISHscale) or any other parquet file reader (https://arrow.apache.org/docs/index.html)

    RNA .parquet files Seven sagittal sections of the mouse brain with 168 detected genes, sampled at the medial-lateral positions of -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm and 3600 µm measured from the midline. Position and gene label for all RNA molecules. "c_px_microscope_stitched" contains X coordinates. "r_px_microscope_stitched" contians Y coordinates. The unit are pixels with a size of 0.18 micrometer. Multiply by 0.18 to get um scale. "Valid" Boolean column where a 1 means that the molecule is detected inside the tissue section. A zero means the molecule is detected outside.

    Tissue polygons .csv files CSV files demarking the sample borders for the 7 mouse atlas sections. -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm, 3600 µm. These polygons were used to generate the "Valid" column. If you want to make your own selection please have a look at the code in: https://github.com/linnarsson-lab/FISHscale/blob/master/FISHscale/utils/inside_polygon.py

    Gene colors .pkl file Pickled Python dictionary with gene colors used in the paper for the mouse atlas.

  13. Surface Water - Aquatic Organism Tissue Sample Results

    • data.cnra.ca.gov
    • data.ca.gov
    csv, pdf, zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Aquatic Organism Tissue Sample Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-aquatic-organism-tissue-sample-results
    Explore at:
    pdf, csv, zipAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    California State Water Resources Control Board
    Description

    This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one.

    The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

    Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  14. d

    Avery Library Historic Preservation and Urban Planning web archive...

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruest, Nick; Sala, Christine; Thurman, Alex (2023). Avery Library Historic Preservation and Urban Planning web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/Z68EVJ
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Ruest, Nick; Sala, Christine; Thurman, Alex
    Description

    Web archive derivatives of the Avery Library Historic Preservation and Urban Planning collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1757-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url The cul-1757-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1757-parquet.tar.gz.part* > cul-1757-parquet.tar.gz

  15. Feature Engineering Data

    • kaggle.com
    Updated Jul 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 23, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mat Leonard
    Description

    This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

    For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

    File descriptions

    train_sample.csv - Sampled data

    Data fields

    Each row of the training data contains a click record, with the following features.

    • ip: ip address of click.
    • app: app id for marketing.
    • device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
    • os: os version id of user mobile phone
    • channel: channel id of mobile ad publisher
    • click_time: timestamp of click (UTC)
    • attributed_time: if user download the app for after clicking an ad, this is the time of the app download
    • is_attributed: the target that is to be predicted, indicating the app was downloaded

    Note that ip, app, device, os, and channel are encoded.

    I'm also including Parquet files with various features for use within the course.

  16. g

    California State Water Resources Control Board - Surface Water - Aquatic...

    • gimi9.com
    Updated Feb 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). California State Water Resources Control Board - Surface Water - Aquatic Organism Tissue Sample Results | gimi9.com [Dataset]. https://gimi9.com/dataset/california_surface-water-aquatic-organism-tissue-sample-results/
    Explore at:
    Dataset updated
    Feb 18, 2021
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    California
    Description

    This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

  17. Surface Water - Aquatic Organism Tissue Sample Results

    • catalog.data.gov
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2024). Surface Water - Aquatic Organism Tissue Sample Results [Dataset]. https://catalog.data.gov/dataset/surface-water-aquatic-organism-tissue-sample-results
    Explore at:
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    California State Water Resources Control Board
    Description

    This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

  18. U

    Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

    • data.usgs.gov
    • gimi9.com
    • +1more
    Updated Mar 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. http://doi.org/10.5066/P13DZ7MP
    Explore at:
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Mar 1, 2013 - Dec 31, 2023
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each dat ...

  19. Z

    PUBG - Match Data

    • data.niaid.nih.gov
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshy, Vivek (2023). PUBG - Match Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10341919
    Explore at:
    Dataset updated
    Dec 10, 2023
    Dataset authored and provided by
    Joshy, Vivek
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset was generated from skihikingkevin/pubg-match-deaths. It only consists of matches where at least one player has player more than 1 game (in different matches). The data was processed using polars and converted from CSV to Parquet files. A random sample was performed (groupwise) to produce an 80/20 split.

  20. d

    University Archives web archive collection derivatives

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruest, Nick; Wilk, Jocelyn; Thurman, Alex (2023). University Archives web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/FONRZU
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Ruest, Nick; Wilk, Jocelyn; Thurman, Alex
    Description

    Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
Organization logo

New York Taxi Data 2009-2016 in Parquet Fomat

Explore at:
bittorrent(35078948106)Available download formats
Dataset updated
Jul 1, 2017
Dataset provided by
New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
Authors
New York Taxi and Limousine Commission
License

https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

Area covered
New York
Description

Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar

Search
Clear search
Close search
Google apps
Main menu