75 datasets found
  1. New York Taxi Data 2009-2016 in Parquet Fomat

    • academictorrents.com
    bittorrent
    Updated Jul 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
    Explore at:
    bittorrent(35078948106)Available download formats
    Dataset updated
    Jul 1, 2017
    Dataset provided by
    New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
    Authors
    New York Taxi and Limousine Commission
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Area covered
    New York
    Description

    Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar

  2. Surface Water - Habitat Results

    • catalog.data.gov
    • datasets.ai
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2024). Surface Water - Habitat Results [Dataset]. https://catalog.data.gov/dataset/surface-water-habitat-results
    Explore at:
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    California State Water Resources Control Board
    Description

    This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool

  3. Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

    • zenodo.org
    zip
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

    The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

    Input data

    The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

    • raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:
      • traj_id: trajectory identifier;
      • user: user identifier;
      • lat: latitude of a trajectory sample;
      • lon: longitude of a trajectory sample;
      • time: timestamp of a sample;

    • pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:
      • osmid: POI OSM identifier
      • element_type: POI OSM element type
      • name: POI native name;
      • name:en: POI English name;
      • wikidata: POI WikiData identifier;
      • geometry: geometry associated with the POI;
      • category: POI category.

    • social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:
      • tweet_ID: identifier of the post;
      • text: post's text;
      • tweet_created: post's timestamp;
      • uid: identifier of the user who posted.

    • weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:
      • DATE: date in which the weather observation was recorded;
      • TAVG_C: average temperature in celsius;
      • DESCRIPTION: weather conditions.

    Output data: the semantically enriched Paris and New York City datasets

    Tabular Representation

    The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

    • traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:
      • (1) we filtered out trajectories having less than 2 samples;
      • (2) we filtered noisy samples inducing velocities above 300km/h:
      • (3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

    • stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:
      • datetime, which indicates when a stop segments starts;
      • leaving_datetime, which indicates when a stop segment ends;
      • uid, the trajectory user's identifier;
      • tid, the trajectory's identifier;
      • lat, the stop segment's centroid latitude;
      • lng, the stop segment's centroid longitude.
        NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).
    • moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:
      • datetime, which indicates when a sample's timestamp;
      • uid, the samples' trajectory user's identifier;
      • tid, the sample's trajectory's identifier;
      • lat, the sample's latitude;
      • lng, the sample's longitude;
      • move_id, the identifier of a move segment.
        NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

    • enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

      The first set of columns concerns a stop's charachteristics:
      • stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;
      • geometry_stop, which is a Shapely Point representing a stop's centroid;
      • geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

    There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

      • index_poi, which is the index of the associated POI in the pois.parqet file;
      • osmid, which is the identifier given by OpenStreetMap to the POI;
      • name, the POI's name;
      • wikidata, the POI identifier on wikidata;
      • category, the POI's category;
      • geometry_poi, a Shapely (multi)polygon describing the POI's geometry;
      • distance, the distance between the stop segment's centroid and the POI.

    • enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:
      • systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;
      • frequency, the number of systematic stops within a systematic stop's cluster;
      • home, the probability that the systematic stop's cluster represents the home of the associated user;
      • work, the probability that the systematic stop's cluster represents the workplace of the associated user;
      • other,

  4. Surface Water - Benthic Macroinvertebrate Results

    • data.cnra.ca.gov
    • data.ca.gov
    csv, pdf, zip
    Updated Jun 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Benthic Macroinvertebrate Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-benthic-macroinvertebrate-results
    Explore at:
    pdf, zip, csvAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    California State Water Resources Control Board
    Description

    Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.

    The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.

    Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  5. Surface Water - Habitat Results

    • data.cnra.ca.gov
    • data.ca.gov
    csv, pdf, zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2025). Surface Water - Habitat Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-habitat-results
    Explore at:
    pdf, csv, zipAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    California State Water Resources Control Board
    Description

    This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

    Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

    Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

  6. P

    DataSeeds.AI-Sample-Dataset-DSD Dataset

    • paperswithcode.com
    Updated Jun 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). DataSeeds.AI-Sample-Dataset-DSD Dataset [Dataset]. https://paperswithcode.com/dataset/dataseeds-ai-sample-dataset-dsd
    Explore at:
    Dataset updated
    Jun 5, 2025
    Description

    Dataset Summary The DataSeeds.AI Sample Dataset (DSD) is a high-fidelity, human-curated computer vision-ready dataset comprised of 7,772 peer-ranked, fully annotated photographic images, 350,000+ words of descriptive text, and comprehensive metadata. While the DSD is being released under an open source license, a sister dataset of over 10,000 fully annotated and segmented images is available for immediate commercial licensing, and the broader GuruShots ecosystem contains over 100 million images in its catalog.

    Each image includes multi-tier human annotations and semantic segmentation masks. Generously contributed to the community by the GuruShots photography platform, where users engage in themed competitions, the DSD uniquely captures aesthetic preference signals and high-quality technical metadata (EXIF) across an expansive diversity of photographic styles, camera types, and subject matter. The dataset is optimized for fine-tuning and evaluating multimodal vision-language models, especially in scene description and stylistic comprehension tasks.

    Technical Report - Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery Github Repo - Access the complete weights and code which were used to evaluate the DSD -- https://github.com/DataSeeds-ai/DSD-finetune-blip-llava This dataset is ready for commercial/non-commercial use. Dataset Structure Size: 7,772 images (7,010 train, 762 validation) Format: Apache Parquet files for metadata, with images in JPG format Total Size: ~4.1GB Languages: English (annotations) Annotation Quality: All annotations were verified through a multi-tier human-in-the-loop process

  7. PSYCHE-D: predicting change in depression severity using person-generated...

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
    Explore at:
    pdf, binAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
    Description

    This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

    Dataset description

    Parquet file, with:

    • 35694 rows
    • 154 columns

    The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

    Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

    File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

    The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

    The data subset used in this work comprises the following:

    • Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study
    • Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities
    • Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month
    • Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

    From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

    The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.

  8. h

    TeXtract_augmented_v1

    • huggingface.co
    Updated Jun 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ToniDO (2025). TeXtract_augmented_v1 [Dataset]. https://huggingface.co/datasets/ToniDO/TeXtract_augmented_v1
    Explore at:
    Dataset updated
    Jun 12, 2025
    Authors
    ToniDO
    Description

    Mathematical Expressions Dataset

      Dataset Description
    

    This dataset contains images of mathematical expressions along with their corresponding LaTeX code. Images will automatically be displayed as thumbnails in Hugging Face's Data Studio.

      Dataset Summary
    

    Number of files: 1 Parquet files Estimated number of samples: 12,312 Format: Parquet optimized for Hugging Face Features configured for thumbnails: ✅ Columns: latex: LaTeX code of the mathematical expression… See the full description on the dataset page: https://huggingface.co/datasets/ToniDO/TeXtract_augmented_v1.

  9. Z

    CKW Smart Meter Data

    • data.niaid.nih.gov
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barahona Garzon, Braulio (2024). CKW Smart Meter Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13304498
    Explore at:
    Dataset updated
    Sep 22, 2024
    Dataset authored and provided by
    Barahona Garzon, Braulio
    Description

    Overview

    The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.

    Data set A - anonimised smart meter data

    Data set B - aggregated smart meter data

    Contents of this data set

    This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet

    The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:

    id: the anonymized counter ID (text)

    timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)

    value_kwh: the consumption in kWh in the time window under consideration (float)

    In this archive, data from:

    | Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |

    was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.

    A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.

    It consists of the following parquet files:

    | Size | Date | Name |

    |------|------|------|

    | 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |

    | 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |

    | 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |

    | 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |

    | 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |

    | 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |

    | 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |

    | 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |

    | 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |

    | 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |

    | 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |

    | 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |

    | 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |

    | 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |

    | 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |

    | 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |

    | 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |

    | 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |

    | 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |

    | 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |

    | 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |

    | 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |

    | 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |

    | 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |

    | 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |

    | 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |

    | 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |

    | 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |

    | 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |

    | 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |

    | 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |

    | 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |

    | 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |

    | 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |

    | 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |

    | 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |

    | 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |

    | 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |

    | 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |

    | 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |

    | 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |

    | 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |

    | 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |

    | 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |

    | 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |

    | 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |

    | 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |

    | 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |

    | 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |

    | 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |

    | 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |

    | 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |

    | 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |

    | 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |

    | 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |

    | 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |

    | 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |

    | 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |

    | 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |

    | 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |

    | 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |

    | 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |

    | 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |

    | 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |

    | 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |

    where each file contains data from a single smart meter.

    Acknowledgement

    The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.

  10. Rare Book and Manuscript Library web archive collection derivatives

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Mar 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Ruest; Nick Ruest (2020). Rare Book and Manuscript Library web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3701593
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 9, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nick Ruest; Nick Ruest
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web archive derivatives of the Rare Book and Manuscript Library collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

    The cul-2766-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    • domain
    • count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    • crawl_date
    • url
    • mime_type_web_server
    • mime_type_tika
    • content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    • crawl_date
    • src
    • dest
    • anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    • src
    • image_url

    Binary Analysis

    • Images
    • PDFs
    • Presentation program files
    • Spreadsheets
    • Word processor files

    The cul-2766-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

    • Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
    • Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
    • Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
    • Domains count file. A text file containing the frequency count of domains captured within your web archive.
  11. Feature Engineering Data

    • kaggle.com
    Updated Jul 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 23, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mat Leonard
    Description

    This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

    For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

    File descriptions

    train_sample.csv - Sampled data

    Data fields

    Each row of the training data contains a click record, with the following features.

    • ip: ip address of click.
    • app: app id for marketing.
    • device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
    • os: os version id of user mobile phone
    • channel: channel id of mobile ad publisher
    • click_time: timestamp of click (UTC)
    • attributed_time: if user download the app for after clicking an ad, this is the time of the app download
    • is_attributed: the target that is to be predicted, indicating the app was downloaded

    Note that ip, app, device, os, and channel are encoded.

    I'm also including Parquet files with various features for use within the course.

  12. d

    Avery Library Historic Preservation and Urban Planning web archive...

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruest, Nick; Sala, Christine; Thurman, Alex (2023). Avery Library Historic Preservation and Urban Planning web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/Z68EVJ
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Ruest, Nick; Sala, Christine; Thurman, Alex
    Description

    Web archive derivatives of the Avery Library Historic Preservation and Urban Planning collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1757-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url The cul-1757-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1757-parquet.tar.gz.part* > cul-1757-parquet.tar.gz

  13. U

    Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

    • data.usgs.gov
    • gimi9.com
    • +1more
    Updated Mar 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. http://doi.org/10.5066/P13DZ7MP
    Explore at:
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Mar 1, 2013 - Dec 31, 2023
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each dat ...

  14. d

    University Archives web archive collection derivatives

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruest, Nick; Wilk, Jocelyn; Thurman, Alex (2023). University Archives web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/FONRZU
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Ruest, Nick; Wilk, Jocelyn; Thurman, Alex
    Description

    Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz

  15. EEL mouse sagittal atlas 168 genes spatial RNA data

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lars Borm (2023). EEL mouse sagittal atlas 168 genes spatial RNA data [Dataset]. http://doi.org/10.6084/m9.figshare.20324814.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lars Borm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw RNA locations of the mouse atlas produced by EEL FISH for 168 genes.

    RNA files are in the .parquet format which can be opened with FISHscale (https://github.com/linnarsson-lab/FISHscale) or any other parquet file reader (https://arrow.apache.org/docs/index.html)

    RNA .parquet files Seven sagittal sections of the mouse brain with 168 detected genes, sampled at the medial-lateral positions of -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm and 3600 µm measured from the midline. Position and gene label for all RNA molecules. "c_px_microscope_stitched" contains X coordinates. "r_px_microscope_stitched" contians Y coordinates. The unit are pixels with a size of 0.18 micrometer. Multiply by 0.18 to get um scale. "Valid" Boolean column where a 1 means that the molecule is detected inside the tissue section. A zero means the molecule is detected outside.

    Tissue polygons .csv files CSV files demarking the sample borders for the 7 mouse atlas sections. -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm, 3600 µm. These polygons were used to generate the "Valid" column. If you want to make your own selection please have a look at the code in: https://github.com/linnarsson-lab/FISHscale/blob/master/FISHscale/utils/inside_polygon.py

    Gene colors .pkl file Pickled Python dictionary with gene colors used in the paper for the mouse atlas.

  16. Fuτure - dataset for studies, development, and training of algorithms for...

    • zenodo.org
    bin
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata (2024). Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons [Dataset]. http://doi.org/10.5281/zenodo.13881061
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data description

    MC Simulation


    The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.

    In this version of the dataset no γγ -> hadrons background is included.

    Samples


    This dataset contains e+e- samples with Z->ττ, ZH,H->ττ and Z->qq events, with approximately 2 million events simulated in each category.

    The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:

    • p8_ee_qq_ecm380 [Z -> qq events]
    • p8_ee_ZH_Htautau [ZH -> Ztautau]
    • p8_ee_Z_Ztautau_ecm380 [ZH -> Ztautau]

    The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.


    Features


    The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.

    Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.

    In summary, the features found in the flat ntuples are:

    NameDescription
    reco_cand_p4s4-momenta per particle in the reco jet.
    reco_cand_chargeCharge per particle in the jet.
    reco_cand_pdgPDGid per particle in the jet.
    reco_jet_p4sRecoJet 4-momenta.
    reco_cand_dzLongitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    reco_cand_dz_errUncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    reco_cand_dxyTransverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    reco_cand_dxy_errUncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
    gen_jet_p4sGenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3.
    gen_jet_tau_decaymodeDecay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used.
    gen_jet_tau_p4sVisible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used.

    The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:

    • a binary flag isTau if it was matched to a generator-level hadronically decaying τ lepton. gen_jet_tau_decaymode of value -1 indicates no match to generator-level hadronically decaying τ.
    • the categorical decay mode of the τ gen_jet_tau_decaymode in terms of the number of generator level charged and neutral hadrons. Possible gen_jet_tau_decaymode are {0, 1, . . . , 15}.
    • if matched, the visible (neglecting neutrinos), reconstructable pT of the τ lepton. This is inferred from the gen_jet_tau_p4s

    Contents:

    • qq_test.parquet
    • qq_train.parquet
    • zh_test.parquet
    • zh_train.parquet
    • z_test.parquet
    • z_train.parquet
    • data_intro.ipynb

    Dataset characteristics

    File# JetsSize
    z_test.parquet
    870 843
    171 MB
    z_train.parquet
    3 483 369
    681 MB
    zh_test.parquet
    1 068 606
    213 MB
    zh_train.parquet
    4 274 423
    851 MB
    qq_test.parquet
    6 366 715
    1.4 GB
    qq_train.parquet
    25 466 858
    5.6 GB

    The dataset consists of 6 files of 8.9 GB in total.

    How can you use these data?

    The .parquet files can be directly loaded with the Awkward Array Python library.
    An example how one might use the dataset and the features is given in data_intro.ipynb

  17. o

    GitTables 1M

    • explore.openaire.eu
    Updated May 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M [Dataset]. http://doi.org/10.5281/zenodo.6517052
    Explore at:
    Dataset updated
    May 3, 2022
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    Description

    Summary GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables. Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions. We believe GitTables can facilitate many use-cases, among which: Data integration, search and validation. Data visualization and analysis recommendation. Schema analysis and completion for e.g. database or knowledge base design. If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets). Dataset contents The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4. In summary, this dataset can be characterized as follows: Statistic Value # tables 1M average # columns 12 average # rows 142 # annotated tables (at least 1 column annotation) 723K+ (DBpedia), 738K+ (Schema.org) # unique semantic types 835 (DBpedia), 677 (Schema.org) How to download The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download). Future releases Future releases will include the following: Increased number of tables (expected at least 10M) Associated datasets - GitTables benchmark - column type detection: https://zenodo.org/record/5706316 - GitTables 1M - CSV files: https://zenodo.org/record/6515973

  18. d

    HERO WEC 2024 Hydraulic Configuration Deployment Data

    • catalog.data.gov
    • mhkdr.openei.org
    • +1more
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). HERO WEC 2024 Hydraulic Configuration Deployment Data [Dataset]. https://catalog.data.gov/dataset/hero-wec-2024-hydraulic-configuration-deployment-data-501bc
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    The following submission includes raw and processed data from the in water deployment of NREL's Hydraulic and Electric Reverse Osmosis Wave Energy Converter (HERO WEC), in the form of parquet files, TDMS files, CSV files, bag files and MATLAB workspaces. This dataset was collected in March 2024 at the Jennette's pier test site in North Carolina. This submission includes the following: Data description document (HERO WEC FY24 Hydraulic Deployment Data Descriptions.doc) - This document includes detailed descriptions of the type of data and how it was processed and/or calculated. Processed MATLAB workspace - The processed data is provided in the form of a single MATLAB workspace containing data from the full deployment. This workspace contains data from all sensors down sampled to 10 Hz along with all array Value Added Products (VAPs). MATLAB visualization scripts - The MATLAB workspaces can be visualized using the file "HERO_WEC_2024_Hydraulic_Config_Data_Viewer.m/mlx". The user simply needs to download the processed MATLAB workspaces, specify the desired start and end times and run this file. Both the .m and .mlx file format has been provided depending on the user's preference. Summary Data - The fully processed data was used to create a summary data set with averages and important calculations performed on 30-minute intervals to align with the intervals of wave resource data reported from nearby CDIP ocean observing buoys located 20km East of Jennette's pier and 40km Northeast of Jennette's pier. The wave resource data provided in this data set is to be used for reference only due the difference in water depth and proximity to shore between the Jennette's pier test site and the locations of the ocean observing buoys. This data is provided in the Summary Data zip folder, which includes this data set in the form of a MATLAB workspace, parquet file, and excel spreadsheet. Processed Parquet File - The processed data is provided in the form of a single parquet file containing data from all HERO WEC sensors collected during the full deployment. Data in these files has been down sampled to 10 Hz and all array VAPs are included. Interim Filtered Data - Raw data from each sensor group partitioned into 30-minute parquet files. These files are outputs from an intermediate stage of data processing and contain the raw data with no Quality Control (QC) or calculations performed in a format that is easier to use than the raw data. Raw Data - Raw, unprocessed data from this deployment can be found in the Raw Data zip folder. This data is provided in the form of TDMS, CSV, and bag files in the original format output by the MODAQ system. Python Data Processing Script - This links to an NREL public github repository containing the python script used to go from raw data to fully processed parquet files. Additional documentation on how to use this script is included in the github repository. This data set has been developed by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Water Power Technologies Office.

  19. Freely Accessible eJournals web archive collection derivatives

    • zenodo.org
    • explore.openaire.eu
    • +1more
    application/gzip
    Updated Feb 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Ruest; Nick Ruest (2020). Freely Accessible eJournals web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3633671
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 2, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nick Ruest; Nick Ruest
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web archive derivatives of the Freely Accessible eJournals collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

    The cul-5921-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    • domain
    • count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    • crawl_date
    • url
    • mime_type_web_server
    • mime_type_tika
    • content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    • crawl_date
    • src
    • dest
    • anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    • src
    • image_url

    Binary Analysis

    • Audio
    • Images
    • PDFs
    • Presentation program files
    • Spreadsheets
    • Text files
    • Word processor files

    The cul-12143-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

    • Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
    • Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
    • Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
    • Domains count file. A text file containing the frequency count of domains captured within your web archive.
  20. Geologic Field Trip Guidebooks Web Archive collection derivatives

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Feb 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Ruest; Nick Ruest; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams (2020). Geologic Field Trip Guidebooks Web Archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3666295
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 13, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nick Ruest; Nick Ruest; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web archive derivatives of the collection Geologic Field Trip Guidebooks Web Archive from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

    The ivy-12576-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    • domain
    • count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    • crawl_date
    • url
    • mime_type_web_server
    • mime_type_tika
    • content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    • crawl_date
    • src
    • dest
    • anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    • src
    • image_url

    Binary Analysis

    • Audio
    • Images
    • PDFs
    • Presentation program files
    • Spreadsheets
    • Text files
    • Word processor files

    The ivy-12576-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

    • Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
    • Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
    • Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
    • Domains count file. A text file containing the frequency count of domains captured within your web archive.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
Organization logo

New York Taxi Data 2009-2016 in Parquet Fomat

Explore at:
bittorrent(35078948106)Available download formats
Dataset updated
Jul 1, 2017
Dataset provided by
New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
Authors
New York Taxi and Limousine Commission
License

https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

Area covered
New York
Description

Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar

Search
Clear search
Close search
Google apps
Main menu