Facebook
TwitterThis dataset contains the AtollGen pipeline inputs, including : - sources.tar.xz : all data used for AtollGen pipeline - abioteau (fasta files and csv file) - iceberg (v1.0: html files) - iceberg_v2 (v2.0: fasta file) - islander (sql dump and sqlite file) - iv4 (csv files) - jlao (FirmiData: fasta files and xlsx files) - int.hmm / mob.hmm : hmm files for integration and other mobility modules - hmm_signature_categs.json : mapping file between signatures recorded in mobility modules hmm file and mobility module categorisation - Pfam-A.hmm.gz : Pfam-A v34.0 frozen version - card.json : CARD antibiotic resistance collection file Defense Finder models (v0.0.3) can be fecthed via macsyfinder download utility (macsydata) on github : https://github.com/gem-pasteur/macsyfinder
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.
Feature Engineering is a crucial step in Machine Learning.
In this project, I show:
- Handling missing values with SimpleImputer
- Encoding categorical variables with OneHotEncoder
- Building models manually vs using Pipeline
- Saving models and pipelines with pickle
- Making predictions with and without pipelines
pipe.pkl) pipe.pkl → Complete ML pipeline (recommended for predictions) clf.pkl → Classifier without pipeline ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features import pickle
pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
sample = [[22, 1, 0, 7.25, 'male', 'S']]
print(pipe.predict(sample))
Predict with pipeline
import pickle
clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
# Preprocess input manually using the encoders, then predict with clf
🎯 Inspiration
Learn difference between manual feature engineering and pipeline-based workflows
Understand how to avoid data leakage using Pipeline
Explore cross-validation with pipelines
Practice model persistence and deployment strategies
✅ Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step!
---
👉 This version is **Kaggle-friendly** (short, structured, with code examples).
Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
Facebook
Twitterhttps://www2.gov.bc.ca/gov/content?id=A519A56BC2BF44E4A008B33FCF527F61https://www2.gov.bc.ca/gov/content?id=A519A56BC2BF44E4A008B33FCF527F61
Use this GeoJSON file as an input dataset in Data Pipelines. To get started, follow the steps in the Create your first data pipeline tutorial.To learn more about Data Pipelines, see Introduction to Data Pipelines.
Facebook
TwitterCLARA This deposit is part of the CLARA project. The CLARA project aims to empower teachers in the task of creating new educational resources. And in particular with the task of handling the licenses of reused educational resources.
The present deposit contains the JSON files extracted from the X5GON Postgresql database. The files are fed to the pipeline of the CLARA project for the creation of 4 different RDF graphs. This is achieved through the use of RDF mappings (RML, RML-star). That pipeline can be found on Gitlab.
The results of this pipeline can also be found on Zenodo, on those four different deposits:
Standard reification
Singleton properties
Named graphs
RDF-star
Content
The JSON files contain information on a total of 45K educational resources, linked to a total of 135K subjects (extracted from DBpedia). Each educational resource is linked to the subjects it talks about. Each of those links has two corresponding scores which represent the certainty of the given link. Those scores are "norm_cosine" and "norm_pageRank".
The dataset was cut into multiple JSON files in order to make its processing easier. There are two type of json files in this deposit:
authors_[X].json - Which lists the authors names
ER_[X].json - Which lists the educational resources and their related information. That information contains:
their title.
their description.
their language (and language_detected, only the first one is used in the pipeline here).
their license.
their mimetype.
the authors.
the date of creation of the resource.
a url linking to the resource itself.
and finally the subjects (named concepts) associated to the resource. With the corresponding scores.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
China PMI: Pipeline & Other Transport & Storage: Input Price data was reported at 63.400 % in Dec 2009. This records a decrease from the previous number of 68.300 % for Nov 2009. China PMI: Pipeline & Other Transport & Storage: Input Price data is updated monthly, averaging 60.030 % from Jan 2008 (Median) to Dec 2009, with 24 observations. The data reached an all-time high of 75.640 % in Jun 2008 and a record low of 41.470 % in Mar 2009. China PMI: Pipeline & Other Transport & Storage: Input Price data remains active status in CEIC and is reported by National Bureau of Statistics. The data is categorized under China Premium Database’s Business and Economic Survey – Table CN.OP: Purchasing Managers' Index: Non Manufacturing: Pipeline & Other Transport & Storage.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a demo dataset to use as input images for the CellProfiler pipeline of CircaSCOPE. Images were retrieved with IncuCyte Zoom microscope (Essen BioScience).
The images are arranged in directories as follows: YYMM/HH/Vessel/Well-Site-Channel.tif
[YY - year,
MM - month,
HH - hour,
Vessel - the vessel number, in this demo- 479 contains untreated control, and 480 contains 100nM Dexamethasone-treated cells
Well - coordinates in 24-well plate
Site - the field of view number inside each well, between 1-16
Channel - C1- green, C2- red, P- phase]
Facebook
TwitterUnique ID of the registered user
How many days a user was active on platform in the last 7 days.
Number of Products viewed by the user in the last 15 days
Vintage (In Days) of the user as of today
Most frequently viewed (page loads) product by the user in the last 15 days. If there are multiple products that have a similar number of page loads then , consider the recent one. If a user has not viewed any product in the last 15 days then put it as Product101.
Most Frequently used OS by user.
Most recently viewed (page loads) product by the user. If a user has not viewed any product then put it as Product101.
Count of Page loads in the last 7 days by the user
Count of Clicks in the last 7 days by the user
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Execution of the preparation pipeline as a single loop over the input file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lithuania Construction Input Price Index: Waste Water Pipelines data was reported at 123.289 2010=100 in Dec 2017. This records an increase from the previous number of 122.584 2010=100 for Nov 2017. Lithuania Construction Input Price Index: Waste Water Pipelines data is updated monthly, averaging 103.161 2010=100 from Jan 2000 (Median) to Dec 2017, with 216 observations. The data reached an all-time high of 123.289 2010=100 in Dec 2017 and a record low of 70.468 2010=100 in Jan 2002. Lithuania Construction Input Price Index: Waste Water Pipelines data remains active status in CEIC and is reported by Statistics Lithuania. The data is categorized under Global Database’s Lithuania – Table LT.I016: Construction Input Price Index: 2010=100. Rebased from 2010=100 to 2015=100 Replacement series ID: 400954327
Facebook
Twitterhttps://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
Technological advances in directional drilling and hydraulic fracturing have boosted US oil and gas output to record highs, significantly strengthening the country’s role as a primary energy supplier and exporter. This production boom has supported a steady increase in natural gas liquid production and met global supply needs amid international disruptions, such as the sanctions on Russia’s energy exports. Industrial expansion and a surge in construction activity have also driven up demand for diesel and gasoline, while electric power generator sales have remained strong. In this environment, the industry generated $15.8 billion in revenue for 2025, growing by 1.0% over the year. Despite the moderation in headline growth, profit rose 7.9% in 2025 as operators benefited from high utilization and stable, fee-based contracts. The US refined petroleum pipeline industry has also experienced stable but slowing revenue growth over the last five years, with a current five-year revenue CAGR of 2.3%. Several key trends are shaping industry performance in 2025. Domestic energy production remains robust, supported by volatile but generally elevated energy prices and ongoing industrial demand, particularly in plastics, manufacturing and power generation sectors. Near-term demand has remained resilient even as electric vehicle adoption accelerates and policy shifts gradually favor renewable energy. At the same time, pipeline operators are facing cost headwinds from lingering tariff pressures on imported steel and aluminum, materials critical for new pipeline construction and maintenance. Tariffs have pushed up input costs, prompting companies to focus on efficiency gains and technology investments, such as Smart Grid networks, to optimize operations and safeguard margins. Market consolidation continues as larger operators seek scale in a shifting regulatory landscape, while ongoing geopolitical risks and energy price volatility reinforce the sector’s focus on reliability and logistics innovation. The broader economic environment, including expectations of lower interest rates from the Federal Reserve, will likely sustain liquidity and support capital access for critical infrastructure upgrades. Looking forward, the outlook for the refined petroleum pipeline industry will be defined by a slower growth trajectory and a gradually evolving energy mix. Persistent demand for petroleum-based products in key sectors will be balanced against regulatory uncertainty, evolving energy transition policies and more modest expansion in new pipeline capacity as energy prices ease. Advances in automation and digital pipeline management should partially offset the impact of slower volume growth and rising compliance costs. Over the next five years, industry revenue is expected to increase at a CAGR of 1.2%, reaching $16.8 billion by 2030, with profit growth strengthening from 7.9% in 2025 to an estimated 8.5% by 2030, as operators adapt to the evolving market landscape.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Natural Gas Imports: Pipeline: From Canada: To Warroad, Minnesota data was reported at 237.000 Cub ft mn in Sep 2025. This records an increase from the previous number of 180.000 Cub ft mn for Aug 2025. Natural Gas Imports: Pipeline: From Canada: To Warroad, Minnesota data is updated monthly, averaging 294.000 Cub ft mn from Jan 2011 (Median) to Sep 2025, with 177 observations. The data reached an all-time high of 599.000 Cub ft mn in Jan 2013 and a record low of 147.000 Cub ft mn in Sep 2022. Natural Gas Imports: Pipeline: From Canada: To Warroad, Minnesota data remains active status in CEIC and is reported by U.S. Energy Information Administration. The data is categorized under Global Database’s United States – Table US.RB: Natural Gas Imports: Pipeline: by Point of Entry.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global data pipeline drift detection market size reached USD 1.42 billion in 2024, driven by the increasing complexity of data ecosystems and the need for robust monitoring solutions. The market is expected to grow at a CAGR of 19.6% during the forecast period, reaching USD 6.09 billion by 2033. This rapid growth is attributed to the surge in adoption of artificial intelligence (AI), machine learning (ML), and big data analytics across various industries, which has heightened the demand for real-time data integrity and quality assurance mechanisms.
One of the primary growth factors for the data pipeline drift detection market is the exponential increase in data volumes and the corresponding need to ensure data quality and reliability. As organizations increasingly rely on automated data pipelines to support business intelligence, decision-making, and customer experiences, the risk of data drift—where input data distributions shift from those seen during model training—has become a critical concern. This has led to substantial investments in drift detection technologies that can proactively identify and mitigate anomalies, ensuring that data-driven operations remain accurate and trustworthy. The proliferation of cloud-native architectures and hybrid data environments further amplifies the need for advanced drift detection solutions that can operate seamlessly across diverse infrastructures.
Another significant driver is the regulatory landscape, which is evolving rapidly in response to data privacy, compliance, and governance requirements. Organizations in highly regulated sectors such as BFSI, healthcare, and retail are under increasing pressure to maintain data integrity and demonstrate compliance with standards such as GDPR, HIPAA, and PCI DSS. Data pipeline drift detection tools provide automated monitoring and alerting capabilities that help these organizations detect deviations, maintain audit trails, and ensure continuous compliance. The integration of drift detection with broader data governance frameworks is becoming a best practice, further fueling market growth as enterprises seek to minimize risk and avoid costly data breaches or regulatory penalties.
Technological advancements are also propelling the market forward. The adoption of AI and ML-powered drift detection algorithms enables organizations to detect subtle and complex data drifts that traditional rule-based systems might miss. These intelligent solutions leverage statistical analysis, pattern recognition, and predictive analytics to provide real-time insights into data pipeline health. Furthermore, the rise of DevOps and DataOps practices is driving the need for automated, scalable, and easily deployable drift detection solutions that can integrate with existing data management workflows. The increasing availability of open-source drift detection frameworks is lowering barriers to entry, enabling even small and medium-sized enterprises to benefit from advanced monitoring capabilities.
From a regional perspective, North America continues to dominate the data pipeline drift detection market, accounting for the largest share in 2024. This leadership is supported by the region's mature IT infrastructure, high adoption of cloud technologies, and the presence of leading technology vendors. However, Asia Pacific is emerging as the fastest-growing region, with a projected CAGR of over 22% through 2033. The rapid digital transformation across sectors in countries like China, India, and Japan, combined with increasing investments in data-driven initiatives, is accelerating demand for drift detection solutions. Europe also represents a significant market, driven by stringent data privacy regulations and a strong focus on data governance across industries.
The component segment of the data pipeline drift detection market is bifurcated into software and services, each playing a pivotal role in the adoption and implementation of drift detection solutions. Software solutions are at the core of this market, encompassing a wide array of tools and platforms designed to automate the detection of data drifts, monitor model performance, and generate actionable alerts. These solutions leverage advanced analytics, AI, and machine learning algorithms to provide real-time insights into data pipeline health. The software segment is wi
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Natural Gas Imports: Pipeline: From Canada: To Highgate Springs, Vermont data was reported at 12,528.000 Cub ft mn in 2024. This records an increase from the previous number of 12,494.000 Cub ft mn for 2023. Natural Gas Imports: Pipeline: From Canada: To Highgate Springs, Vermont data is updated yearly, averaging 9,319.000 Cub ft mn from Dec 1996 (Median) to 2024, with 29 observations. The data reached an all-time high of 14,574.000 Cub ft mn in 2016 and a record low of 7,680.000 Cub ft mn in 1998. Natural Gas Imports: Pipeline: From Canada: To Highgate Springs, Vermont data remains active status in CEIC and is reported by U.S. Energy Information Administration. The data is categorized under Global Database’s United States – Table US.RB030: Natural Gas Imports: Pipeline: by Point of Entry: Annual.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corpora used to calculate commonly occurring workflow fragments from the LONI Pipeline
Facebook
TwitterTaken from the README of the google-research/big_transfer repo:
by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby
In this repository we release multiple models from the Big Transfer (BiT): General Visual Representation Learning paper that were pre-trained on the ILSVRC-2012 and ImageNet-21k datasets. We provide the code to fine-tuning the released models in the major deep learning frameworks TensorFlow 2, PyTorch and Jax/Flax.
We hope that the computer vision community will benefit by employing more powerful ImageNet-21k pretrained models as opposed to conventional models pre-trained on the ILSVRC-2012 dataset.
We also provide colabs for a more exploratory interactive use: a TensorFlow 2 colab, a PyTorch colab, and a Jax colab.
Make sure you have Python>=3.6 installed on your machine.
To setup Tensorflow 2, PyTorch or Jax, follow the instructions provided in the corresponding repository linked here.
In addition, install python dependencies by running (please select tf2, pytorch or jax in the command below):
pip install -r bit_{tf2|pytorch|jax}/requirements.txt
First, download the BiT model. We provide models pre-trained on ILSVRC-2012 (BiT-S) or ImageNet-21k (BiT-M) for 5 different architectures: ResNet-50x1, ResNet-101x1, ResNet-50x3, ResNet-101x3, and ResNet-152x4.
For example, if you would like to download the ResNet-50x1 pre-trained on ImageNet-21k, run the following command:
wget https://storage.googleapis.com/bit_models/BiT-M-R50x1.{npz|h5}
Other models can be downloaded accordingly by plugging the name of the model (BiT-S or BiT-M) and architecture in the above command.
Note that we provide models in two formats: npz (for PyTorch and Jax) and h5 (for TF2). By default we expect that model weights are stored in the root folder of this repository.
Then, you can run fine-tuning of the downloaded model on your dataset of interest in any of the three frameworks. All frameworks share the command line interface
python3 -m bit_{pytorch|jax|tf2}.train --name cifar10_`date +%F_%H%M%S` --model BiT-M-R50x1 --logdir /tmp/bit_logs --dataset cifar10
Currently. all frameworks will automatically download CIFAR-10 and CIFAR-100 datasets. Other public or custom datasets can be easily integrated: in TF2 and JAX we rely on the extensible tensorflow datasets library. In PyTorch, we use torchvision’s data input pipeline.
Note that our code uses all available GPUs for fine-tuning.
We also support training in the low-data regime: the `--examples_per_class
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
⚠️⚠️⚠️ NSFW Content Warning ⚠️⚠️⚠️This dataset/model contains content that may be offensive or inappropriate for some users, including NSFW (Not Safe For Work) material.Please proceed with caution.
DataFlow demo -- Text Pipeline
This dataset card serves as a demo for showcasing the Text data processing pipeline of the Dataflow Project. It provides an intuitive view of the pipeline’s input dirty data and filtered outputs.
Overview
The purpose of the Text Pipeline is to… See the full description on the dataset page: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-Text.
Facebook
TwitterThis dataset contains Natural Gas Imports by Entry Point - International pipelines. Follow datasource.kapsarc.org for timely data to advance energy economics research.Notes:CORES uses GWh as unit of measure for Natural Gas.
Facebook
TwitterIn terms of transportation costs, the most expensive import source for natural gas in Italy was Norway: on average, a metric ton of Norwegian gas imported via the Griess mountain pass cost ** euros in 2018. By contrast, Russian gas had the lowest transport costs, as it cost about **** euros to import a metric ton of gas via the Tarvisio-Malborghetto pipeline.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains example input data, including raw images, codebooks, parameters, and segmentation labels needed to run the FISH spatial transcriptomics pipeline tool PIPEFISH. The datasets contained are:
in situ sequencing (ISS) of a whole coronal slice of a mouse brain (50 genes). Link to publication.
Gataric, M., Park, J.S., Li, T., Vaskivskyi, V., Svedlund, J., Strell, C., Roberts, K., Nilsson, M., Yates, L.R., Bayraktar, O. and Gerstung, M., 2021. PoSTcode: Probabilistic image-based spatial transcriptomics decoder. bioRxiv, pp.2021-10.
MERFISH of human U2-OS cell cultures (130 genes). Link to publication.
Moffitt, J.R., Hao, J., Wang, G., Chen, K.H., Babcock, H.P. and Zhuang, X., 2016. High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. Proceedings of the National Academy of Sciences, 113(39), pp.11046-11051.
seqFISH of a developing mouse embryo (351 genes). Link to publication.
Lohoff, T., Ghazanfar, S., Missarova, A., Koulena, N., Pierson, N., Griffiths, J.A., Bardot, E.S., Eng, C.H., Tyser, R.C.V., Argelaguet, R. and Guibentif, C., 2022. Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nature biotechnology, 40(1), pp.74-85.
In order to correctly format the inputs, run the prep_input.py script for the dataset you wish to run while in the same directory as the script.
Memory requirements for each dataset:
iss_mouse_brain - 3GB
merfish_human_u2os - 7GB
seqfish_mouse_embryo - 37GB
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recombinant Read Extraction Pipeline with Test Input DataDescription:This dataset showcases the Recombinant Read Extraction Pipeline, previously developed by us (https://doi.org/10.6084/m9.figshare.26582380), designed for the detection of recombination events in sequencing data. The pipeline enables the alignment of sequence reads to a reference genome, generation of SNP strings, identification of haplotypes, extraction of recombinant sequences, and comprehensive result compilation into an Excel summary for seamless analysis.Included in this dataset:config.json: Configuration file with default settings.pipeline_test_reads.fa: A test FASTA file containing simulated recombination and allele replacement events, specifically:Two recombination events each covered by 15 reads, transitioning between Solanum lycopersicum cv. Moneyberg and Moneymaker haplotypes.One recombination event covered by 20 reads, involving a switch at the extremity of the amplicon analysed from Moneymaker to Moneyberg haplotype.One allele replacement event covered by 20 reads, featuring recombination from Moneymaker to Moneyberg and back to Moneymaker.Wild-type Solanum lycopersicum cv. Moneyberg and Moneymaker sequences.final_output.xlsx: Example output summarizing read names, sequences, and read counts.Usage Instructions:Install Dependencies: Follow the installation guidelines to set up required software and Python libraries (please refer to https://doi.org/10.6084/m9.figshare.26582380).Configure Pipeline: Customize parameters in config.json as needed.Run Pipeline: Execute the pipeline using the provided script to process the test input file.Review Outputs: Examine final_output.xlsx to verify the detection and summarization of recombinant events.The dataset pipeline_test_reads.fa serves as a control dataset designed to verify the functionality of the Recombinant Read Extraction Pipeline previously described (https://doi.org/10.6084/m9.figshare.26582380). This dataset contains artificially generated "reads" and does not include any genuine DNA sequencing data.Keywords: Genomic Data Processing, Recombinant Detection, Haplotype Analysis, Bioinformatics Pipeline, SNP Analysis
Facebook
TwitterThis dataset contains the AtollGen pipeline inputs, including : - sources.tar.xz : all data used for AtollGen pipeline - abioteau (fasta files and csv file) - iceberg (v1.0: html files) - iceberg_v2 (v2.0: fasta file) - islander (sql dump and sqlite file) - iv4 (csv files) - jlao (FirmiData: fasta files and xlsx files) - int.hmm / mob.hmm : hmm files for integration and other mobility modules - hmm_signature_categs.json : mapping file between signatures recorded in mobility modules hmm file and mobility module categorisation - Pfam-A.hmm.gz : Pfam-A v34.0 frozen version - card.json : CARD antibiotic resistance collection file Defense Finder models (v0.0.3) can be fecthed via macsyfinder download utility (macsydata) on github : https://github.com/gem-pasteur/macsyfinder