Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).
The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.
The WikiGrain Data is analyzed in the paper
Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.
===============================================================
Individual files (tables in comma-separated-values-format):
---------------------------------------------------------------
* article_info.csv contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.
- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.
- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.
- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.
- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.
---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.
- "title.of.tlc"
(string) Title of the TLC in which the article is contained.
---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:
- "id"
Article id.
- "is.FA"
Boolean indicator for whether the article is featured.
- "log1p.length"
Length measured by the number of bytes.
- "age"
Age measured by the time since the first edit.
- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.
- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.
- "log1p.number.of.contributors"
Number of unique contributors to the article.
- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').
- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').
- "number.of.level.1.sections"
Number of first level sections in the article.
- "number.of.level.2.sections"
Number of second level sections in the article.
- "number.of.categories"
Number of categories the article is in.
- "log1p.average.size.of.categories"
Average size of the categories the article is in.
- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.
- "log1p.number.of.external.references"
Number of external references given in the article.
- "log1p.number.of.images"
Number of images in the article.
- "log1p.number.of.templates"
Number of templates that the article uses.
- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.
- "granularity"
As in article_info.csv (but normalized to standard deviation one).
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Time-Series Database for Network Telemetry market size in 2024 reached USD 1.23 billion, reflecting the rapid adoption of advanced database solutions for real-time network management. The market is experiencing robust expansion, with a CAGR of 19.7% projected over the forecast period. By 2033, the market is expected to attain a value of USD 5.94 billion, driven by the imperative need for scalable, high-performance data management platforms to support increasingly complex network infrastructures. The primary growth factor is the surge in network traffic, the proliferation of IoT devices, and the escalating demand for actionable network insights in real time.
A key driver behind the exponential growth of the Time-Series Database for Network Telemetry market is the unprecedented expansion of digital transformation initiatives across industries. Enterprises and service providers are generating massive volumes of telemetry data from network devices, applications, and endpoints. Traditional relational databases are ill-equipped to handle the high velocity and granularity of time-stamped data required for effective network telemetry. Time-series databases, purpose-built for this data type, enable organizations to ingest, process, and analyze millions of data points per second, facilitating proactive network management. The shift towards cloud-native architectures, edge computing, and the adoption of 5G networks further amplify the need for efficient telemetry data storage and analytics, reinforcing the critical role of time-series databases in modern network operations.
Another significant growth factor is the rising complexity of network environments, spurred by the advent of hybrid and multi-cloud deployments. As organizations embrace distributed infrastructures and software-defined networking, the challenge of monitoring, diagnosing, and optimizing network performance becomes more acute. Time-series databases for network telemetry empower IT teams with the ability to correlate historical and real-time data, detect anomalies, and automate fault management. This capability is particularly vital for sectors such as telecommunications, IT service providers, and large enterprises, where network downtime or performance degradation can have substantial financial and reputational repercussions. The integration of artificial intelligence and machine learning with time-series databases is also enabling advanced predictive analytics, further enhancing operational efficiency and network reliability.
The growing emphasis on network security and compliance is another pivotal factor fueling the adoption of time-series databases for network telemetry. With cyber threats becoming more sophisticated and regulatory requirements tightening, organizations must maintain comprehensive visibility into network activities and ensure rapid incident detection and response. Time-series databases provide the high-resolution data capture and retention necessary for security analytics, forensic investigations, and regulatory audits. As network telemetry evolves to encompass not only performance metrics but also security events and policy violations, the demand for scalable and secure time-series database solutions is expected to surge across both public and private sectors.
From a regional perspective, North America currently dominates the Time-Series Database for Network Telemetry market, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of major technology vendors, early adoption of advanced network management solutions, and substantial investments in digital infrastructure. However, the Asia Pacific region is poised for the fastest growth, with a projected CAGR of 22.4% through 2033, driven by rapid urbanization, expanding telecommunications networks, and increasing enterprise digitization. Europe and the Middle East & Africa are also witnessing steady growth, supported by government initiatives to modernize network infrastructure and enhance cybersecurity capabilities.
The Database Type segment of the Time-Series Database for Network Telemetry market is bifurcated into Open Source and Commercial solutions, each catering to distinct
Facebook
TwitterStates report information from two reporting populations: (1) The Served Population which is information on all youth receiving at least one independent living services paid or provided by the Chafee Program agency, and (2) Youth completing the NYTD Survey. States survey youth regarding six outcomes: financial self-sufficiency, experience with homelessness, educational attainment, positive connections with adults, high-risk behaviors, and access to health insurance. States collect outcomes information by conducting a survey of youth in foster care on or around their 17th birthday, also referred to as the baseline population. States will track these youth as they age and conduct a new outcome survey on or around the youth's 19th birthday; and again on or around the youth's 21st birthday, also referred to as the follow-up population. States will collect outcomes information on these older youth at ages 19 or 21 regardless of their foster care status or whether they are still receiving independent living services from the State. Depending on the size of the State's foster care youth population, some States may conduct a random sample of the baseline population of the 17-year-olds that participate in the outcomes survey so that they can follow a smaller group of youth as they age. All States will collect and report outcome information on a new baseline population cohort every three years. Units of Response: Current and former youth in foster care Type of Data: Survey Tribal Data: No Periodicity: Annual Demographic Indicators: Ethnicity;Race;Sex SORN: Not Applicable Data Use Agreement: https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm Data Use Agreement Location: https://www.ndacan.acf.hhs.gov/datasets/order_forms/termsofuseagreement.pdf Granularity: Individual Spatial: United States Geocoding: State
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Terrain and Obstacle Database market size reached USD 6.8 billion in 2024, reflecting a robust surge in demand across key sectors. The market is projected to expand at a CAGR of 10.7% from 2025 to 2033, with the total market value forecasted to hit USD 17.2 billion by 2033. This impressive growth is primarily fueled by the increasing adoption of advanced navigation systems, the proliferation of autonomous vehicles, and stringent regulatory mandates for safety in aviation and defense sectors.
The principal growth driver for the Terrain and Obstacle Database market is the rapid evolution and integration of digital mapping technologies within critical applications such as aviation, defense, and autonomous vehicles. As industries transition towards automation and real-time decision-making, the need for highly accurate, up-to-date terrain and obstacle data has become paramount. Modern aircraft, for example, require seamless access to global terrain and obstacle databases to enhance situational awareness, avoid potential hazards, and comply with international safety standards. Similarly, defense and military operations are increasingly dependent on these databases for mission planning, threat detection, and tactical navigation. The convergence of artificial intelligence, machine learning, and geospatial analytics is further accelerating the sophistication and utility of terrain and obstacle databases, making them indispensable for next-generation mobility and security solutions.
Another significant factor propelling the expansion of the Terrain and Obstacle Database market is the escalating emphasis on public safety and urban planning. With the proliferation of smart cities and the growing complexity of urban environments, municipal authorities and infrastructure planners are leveraging detailed terrain and obstacle data to optimize land use, enhance emergency response, and mitigate risks associated with natural disasters and urban expansion. The increasing deployment of drones for commercial, delivery, and surveillance applications also necessitates comprehensive databases to ensure safe navigation through densely populated or obstacle-rich environments. These trends are encouraging both public and private entities to invest in robust data acquisition, curation, and management solutions, thereby driving sustained market growth.
Furthermore, the surge in demand for real-time, cloud-based data solutions is reshaping the competitive dynamics of the Terrain and Obstacle Database market. Cloud deployment offers scalability, remote accessibility, and seamless updates, making it particularly attractive for global enterprises and government agencies managing large-scale operations. The integration of terrain and obstacle databases with IoT devices, 5G networks, and edge computing is enhancing the granularity and timeliness of data delivery, supporting critical applications such as autonomous vehicle navigation, disaster management, and precision agriculture. As regulatory frameworks continue to tighten and technology adoption accelerates, the market is poised for significant innovation and value creation over the next decade.
From a regional perspective, North America currently dominates the Terrain and Obstacle Database market, accounting for the largest revenue share in 2024. The region’s leadership is attributed to the presence of major aerospace, defense, and technology firms, as well as early adoption of advanced navigation and data management solutions. Europe and Asia Pacific are also witnessing substantial growth, driven by increasing investments in smart infrastructure, autonomous mobility, and national security initiatives. The Asia Pacific region, in particular, is expected to register the highest CAGR during the forecast period, fueled by rapid urbanization, expanding aviation sectors, and government-driven digital transformation projects.
The Component segment of the Terrain and Obstacle Database market comprises Database Software, Data Services, and Hardware, each playing a critical role in the value chain. Database software forms the backbone of the market, enabling users to store, retrieve, and analyze vast quantities of terrain and obstacle data with high precision. The demand for robust, scalable, and user-friendly da
Facebook
Twitter
According to our latest research, the global Digital Terrain Database market size in 2024 stands at USD 2.54 billion, with a robust year-on-year growth trajectory. The market is expected to expand at a CAGR of 9.2% from 2025 to 2033, reaching a forecasted value of USD 5.67 billion by 2033. This growth is primarily driven by the increasing adoption of advanced geospatial technologies across various sectors, including defense, civil engineering, and urban planning, as organizations seek to leverage high-precision terrain data for enhanced decision-making and operational efficiency.
The Digital Terrain Database market is experiencing significant momentum due to the rising demand for accurate topographical information in mission-critical applications. The integration of digital terrain data in aerospace and defense operations, such as flight simulation, mission planning, and navigation, is a key growth factor. These sectors require precise elevation models to ensure safety, optimize routes, and enhance situational awareness. Furthermore, the proliferation of unmanned aerial vehicles (UAVs) and autonomous systems has intensified the need for real-time, high-resolution terrain data, propelling the adoption of sophisticated digital terrain databases. As defense budgets continue to prioritize geospatial intelligence, the market is poised for sustained expansion.
Another pivotal growth driver for the Digital Terrain Database market is the rapid urbanization and infrastructure development observed globally. Civil engineering and urban planning sectors are increasingly relying on detailed terrain models for designing resilient infrastructure, mitigating natural hazards, and optimizing land use. The surge in smart city initiatives, particularly in emerging economies, necessitates the deployment of advanced geospatial solutions. Digital terrain databases enable planners and engineers to simulate various scenarios, assess environmental impacts, and streamline construction processes. The integration of terrain data with Building Information Modeling (BIM) and Geographic Information Systems (GIS) further amplifies its value, fostering market growth across public and private sectors.
Technological advancements and the growing accessibility of cloud-based geospatial solutions are also catalyzing market expansion. Cloud deployment models are democratizing access to high-quality terrain data, enabling organizations of all sizes to leverage these resources without significant upfront investments in hardware or infrastructure. The evolution of data acquisition methods, such as LiDAR, satellite imagery, and photogrammetry, has enhanced the accuracy and granularity of digital terrain databases. This, coupled with the increasing emphasis on environmental monitoring, disaster management, and agricultural optimization, is broadening the application landscape and stimulating demand for digital terrain databases across diverse verticals.
From a regional perspective, North America currently dominates the Digital Terrain Database market, attributed to the presence of leading technology providers, robust defense spending, and widespread adoption of geospatial technologies. Europe follows closely, driven by stringent regulatory frameworks and substantial investments in infrastructure modernization. The Asia Pacific region is anticipated to exhibit the fastest growth during the forecast period, fueled by rapid urbanization, government-led smart city projects, and expanding applications in agriculture and environmental monitoring. Latin America and the Middle East & Africa are also witnessing increased adoption, albeit from a lower base, as digital transformation initiatives gain traction across these regions.
The Digital Terrain Database market by component is segmented into Software, Hardware, and Services, each playing a vital role in the overall ecosystem. Software solutions form the backbone
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: In Brazil, studies that map electronic healthcare databases in order to assess their suitability for use in pharmacoepidemiologic research are lacking. We aimed to identify, catalogue, and characterize Brazilian data sources for Drug Utilization Research (DUR).Methods: The present study is part of the project entitled, “Publicly Available Data Sources for Drug Utilization Research in Latin American (LatAm) Countries.” A network of Brazilian health experts was assembled to map secondary administrative data from healthcare organizations that might provide information related to medication use. A multi-phase approach including internet search of institutional government websites, traditional bibliographic databases, and experts’ input was used for mapping the data sources. The reviewers searched, screened and selected the data sources independently; disagreements were resolved by consensus. Data sources were grouped into the following categories: 1) automated databases; 2) Electronic Medical Records (EMR); 3) national surveys or datasets; 4) adverse event reporting systems; and 5) others. Each data source was characterized by accessibility, geographic granularity, setting, type of data (aggregate or individual-level), and years of coverage. We also searched for publications related to each data source.Results: A total of 62 data sources were identified and screened; 38 met the eligibility criteria for inclusion and were fully characterized. We grouped 23 (60%) as automated databases, four (11%) as adverse event reporting systems, four (11%) as EMRs, three (8%) as national surveys or datasets, and four (11%) as other types. Eighteen (47%) were classified as publicly and conveniently accessible online; providing information at national level. Most of them offered more than 5 years of comprehensive data coverage, and presented data at both the individual and aggregated levels. No information about population coverage was found. Drug coding is not uniform; each data source has its own coding system, depending on the purpose of the data. At least one scientific publication was found for each publicly available data source.Conclusions: There are several types of data sources for DUR in Brazil, but a uniform system for drug classification and data quality evaluation does not exist. The extent of population covered by year is unknown. Our comprehensive and structured inventory reveals a need for full characterization of these data sources.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterOur study analyzes the limitations of Bluetooth-based trace acquisition initiatives carried out until now in terms of granularity and reliability. We then go on to propose an optimal configuration for the acquisition of proximity traces and movement information using a fine-tuned Bluetooth system based on custom HW. With this system and based on such a configuration, we have carried out an intensive human trace acquisition experiment resulting in a proximity and mobility database of more than 5 million traces with a minimum granularity of 5 s. ; josemari.cabero@tecnalia.com
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Origin:Samples were taken from customer taps. They were then analysed, and the results were uploaded to a database. This dataset is an extract from this database.Data Triage Considerations:Granularity:We decided to share as individual results at the lowest level of granularity.Anonymisation:It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed: Water Supply Zone (WSZ) - Limits interoperability with other datasets Postcode – Some postcodes contain very few households and may not offer necessary anonymisation Postal Sector – Deemed not granular enough in highly populated areas Rounded Co-ordinates – Not a recognised standard and may cause overlapping areas MSOA – Deemed not granular enough LSOA – Agreed as a recognised standard appropriate for England and Wales Data Zones – Agreed as a recognised standard appropriate for Scotland Data Specifications:Each dataset will cover a calendar year of samplesThis dataset will be published annuallyThe Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate Context:Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset. Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area. Some samples are tested on site and others are sent to scientific laboratories.Prior to undertaking analysis on any new instruments or utilising new analytical techniques, the laboratory undertakes validation of the equipment to ensure it continues to meet the regulatory requirements. This means that the limit of quantification may change for the method either increasing or decreasing from the previous value. Any results below the limit of quantification will be reported as < with a number. For example, a limit of quantification change from <0.68 mg/l to <2.4 mg/l does not mean that there has been a deterioration in the quality of the water supplied. Data Publishing Frequency:AnnuallySupplementary information:Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset: Drinking Water Inspectorate Standards and Regulations Description for LSOA boundaries by the ONS: Census 2021 geographies - Office for National Statistics Postcode to LSOA lookup tables: Postcode to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer Super Output (February 2024)Legislation history: Legislation - Drinking Water InspectorateInformation about lead pipes: Lead pipes and lead in your water - United UtilitiesDataset Schema:SAMPLE_ID: Identity of the sampleSAMPLE_DATE: The date the sample was takenDETERMINAND: The determinand being measuredDWI_CODE: The corresponding DWI code for the determinandUNITS: The expression of resultsOPERATOR: The measurement operator for limit of detectionRESULT: The test resultsLSOA: Lower Super Output Area (population weighted centroids used by the Office for National Statistics (ONS) for geo-anonymisation)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises spatial and temporal agricultural data compiled from the Farm Accountancy Data Network (FADN) Public Database available at the FADN-region level and further disaggregated using Corine Land Cover (CLC) information about agricultural area.
Processing to NUTS level:
CLC data layers were used to overlay what is defined as "Agricultural areas" in CLC level 1classification (#2**) with FADN and NUTS regions. The overlay allows to calculate area-weighted shares and further to allocate FADN farm weights to the NUTS level. This allows the application of weights at the NUTS granularity. Please keep in mind, this is only possible under the assumption of heterogeneous farms within the FADN region.
File description:
The dataset consists of eight files, corresponding to four different levels of NUTS coding (NUTS 0-3) according to the 2016 NUTS specification and each of those for the two different sampling periods.
FADN data from 2004 onwards, standard results calculated for farms grouped according to EU typology of agricultural holdings based on standard output (SO).
FADN data from 1989 to 2009, standard results calculated for farms grouped according to EU typology of agricultural holdings based on standard gross margin (SGM).
For each csv file, the following columns are included:
Identifier:
Variables:
3. weighting: number of farms represented
4. and after: SE standard result variables, for detailed description, please have a look at the accompanied xlsx: variable_description_zenodo.xlsx
Source information:
The raw data for the public Farm Accountancy Data Network (FADN) can be accessed through the official platform using the following link: FADN Public Database.
The CLC layers for the weighting of the spatial disaggregation can be accessed via the Copernicus homepage undern the following link: https://land.copernicus.eu/en/products/corine-land-cover
This dataset has been created as part of LAMASUS Project under the scope of Deliverable 3.2 titled "Database on EU policies and payments for agriculture, forest, and other LUM related drivers ". The data is directly linked to the work described on pages 50-57, belonging to section 3.6 Public FADN Data. The full text of the deliverable can be accessed via: https://www.lamasus.eu/wp-content/uploads/LAMASUS_D3.2_policy-and-payment-database.pdf.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set extracted from the official German topographic database ATKIS with the highest provided granularity at scale 1:25.000. It covers a rural area in the northwest of North Rhine-Westphalia containing 1647 polygons of six different land-use classes and a road network comprising line features of different road types.
Facebook
TwitterThe Advocate Caller Application database includes information about each contact to the National Domestic Violence Hotline (The Hotline) or loveisrespect (LIR) helpline, made by telephone, chat, text, e-mail, or social media. This information is entered into the database manually by advocates at the time of contact. It is primarily used for service provision and operational purposes. It does not include any PII.
The Advocate Caller Application database includes demographic information about the person who called, chatted, texted, etc., and his/her situation (e.g., type of abuse), and information about what happened during the call, chat, or text (e.g., topics discussed, services provided, etc.). It also includes information about caller needs and reported barriers to receiving services.
Units of Response: Abuse Victims
Type of Data: Administrative
Tribal Data: Unavailable
COVID-19 Data: Unavailable
Periodicity: Unavailable
Data Use Agreement: https://www.icpsr.umich.edu/rpxlogin
Data Use Agreement Location: Unavailable
Equity Indicators: Unavailable
Granularity: Individual
Spatial: Unavailable
Geocoding: Unavailable
Facebook
TwitterCOVID 19 data at city/urban granularity compiled on a monthly basis since May 2020. Due to changes in reporting, there are variations in the number of cities in each monthly update.
Facebook
TwitterThe US Consumer Household Database — Weekly Refreshed is AmeriList’s premier consumer dataset, built for marketers, agencies, and enterprises that demand accurate, scalable, and timely U.S. consumer data. Covering over 200 million households nationwide and enriched with 200+ lifestyle, demographic, and behavioral attributes, this file is one of the most complete and frequently updated consumer databases available today.
Why Choose This Database?
Today’s marketing success depends on reaching the right audience at the right time. With this dataset, you gain: - Nationwide coverage of U.S. households (≈95%). - Unmatched attribute depth including age, income, marital status, homeownership, and lifestyle interests. - Freshness you can trust with weekly updates to keep your campaigns aligned with real-world consumer changes. - Multi-channel readiness with delivery via CSV, API, SFTP, or cloud integrations (AWS, GCP, Azure).
Key Features - 200M+ U.S. households for broad reach. - 200+ attributes spanning demographics, lifestyle, purchase signals, and household composition. - Household-level granularity with linkable fields for segmentation and modeling. - Evaluation samples under NDA to test match rates and validate quality.
Use Cases This dataset powers a wide range of data-driven marketing strategies:
Industries That Benefit
Licensing & Access
The US Consumer Household Database is offered via 12-month subscription, with continuous weekly updates included. Evaluation samples are available under NDA. Flexible licensing models ensure it fits enterprises of all sizes.
Why AmeriList? For over 20 years, AmeriList has been a trusted leader in direct marketing data solutions. Our expertise in consumer databases, mailing lists, and CRM enrichment ensures not only the accuracy of the data but also the strategic value it delivers. With a focus on quality, compliance, and ROI, AmeriList helps brands and agencies unlock the full potential of consumer marketing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Water companies in the UK are responsible for testing the quality of drinking water. This dataset contains the results of samples taken from the taps in domestic households to make sure they meet the standards set out by UK and European legislation. This data shows the location, date, and measured levels of determinands set out by the Drinking Water Inspectorate (DWI).
Key Definitions
Aggregation
Process involving summarizing or grouping data to obtain a single or reduced set of information, often for analysis or reporting purposes
Anonymisation
Anonymised data is a type of information sanitization in which data anonymisation tools encrypt or remove personally identifiable information from datasets for the purpose of preserving a data subject's privacy
Dataset
Structured and organized collection of related elements, often stored digitally, used for analysis and interpretation in various fields.
Determinand
A constituent or property of drinking water which can be determined or estimated.
DWI
Drinking Water Inspectorate, an organisation “providing independent reassurance that water supplies in England and Wales are safe and drinking water quality is acceptable to consumers.”
DWI Determinands
Constituents or properties that are tested for when evaluating a sample for its quality as per the guidance of the DWI. For this dataset, only determinands with “point of compliance” as “customer taps” are included.
Granularity
Data granularity is a measure of the level of detail in a data structure. In time-series data, for example, the granularity of measurement might be based on intervals of years, months, weeks, days, or hours
ID
Abbreviation for Identification that refers to any means of verifying the unique identifier assigned to each asset for the purposes of tracking, management, and maintenance.
LSOA
Lower-Level Super Output Area is made up of small geographic areas used for statistical and administrative purposes by the Office for National Statistics. It is designed to have homogeneous populations in terms of population size, making them suitable for statistical analysis and reporting. Each LSOA is built from groups of contiguous Output Areas with an average of about 1,500 residents or 650 households allowing for granular data collection useful for analysis, planning and policy- making while ensuring privacy.
ONS
Office for National Statistics
Open Data Triage
The process carried out by a Data Custodian to determine if there is any evidence of sensitivities associated with Data Assets, their associated Metadata and Software Scripts used to process Data Assets if they are used as Open Data. <
Sample
A sample is a representative segment or portion of water taken from a larger whole for the purpose of analysing or testing to ensure compliance with safety and quality standards.
Schema
Structure for organizing and handling data within a dataset, defining the attributes, their data types, and the relationships between different entities. It acts as a framework that ensures data integrity and consistency by specifying permissible data types and constraints for each attribute.
Units
Standard measurements used to quantify and compare different physical quantities.
Water Quality
The chemical, physical, biological, and radiological characteristics of water, typically in relation to its suitability for a specific purpose, such as drinking, swimming, or ecological health. It is determined by assessing a variety of parameters, including but not limited to pH, turbidity, microbial content, dissolved oxygen, presence of substances and temperature.
Data History
Data Origin
These samples were taken from customer taps. They were then analysed for water quality, and the results were uploaded to a database. This dataset is an extract from this database.
Data Triage Considerations
Granularity
Is it useful to share results as averages or individual?
We decided to share as individual results as the lowest level of granularity
Anonymisation
It is a requirement that this data cannot be used to identify a singular person or household. We discussed many options for aggregating the data to a specific geography to ensure this requirement is met. The following geographical aggregations were discussed:
<!--·
Water Supply Zone (WSZ) - Limits interoperability
with other datasets
<!--·
Postcode – Some postcodes contain very few
households and may not offer necessary anonymisation
<!--·
Postal Sector – Deemed not granular enough in
highly populated areas
<!--·
Rounded Co-ordinates – Not a recognised standard
and may cause overlapping areas
<!--·
MSOA – Deemed not granular enough
<!--·
LSOA – Agreed as a recognised standard appropriate
for England and Wales
<!--·
Data Zones – Agreed as a recognised standard
appropriate for Scotland
Data Specifications
Each dataset will cover a calendar year of samples
This dataset will be published annually
Historical datasets will be published as far back as 2016 from the introduction of of The Water Supply (Water Quality) Regulations 2016
The Determinands included in the dataset are as per the list that is required to be reported to the Drinking Water Inspectorate.
Context
Many UK water companies provide a search tool on their websites where you can search for water quality in your area by postcode. The results of the search may identify the water supply zone that supplies the postcode searched. Water supply zones are not linked to LSOAs which means the results may differ to this dataset
Some sample results are influenced by internal plumbing and may not be representative of drinking water quality in the wider area.
Some samples are tested on site and others are sent to scientific laboratories.
Data Publish Frequency
Annually
Data Triage Review Frequency
Annually unless otherwise requested
Supplementary information
Below is a curated selection of links for additional reading, which provide a deeper understanding of this dataset.
<!--1.
Drinking Water
Inspectorate Standards and Regulations:
<!--2.
https://www.dwi.gov.uk/drinking-water-standards-and-regulations/
<!--3.
LSOA (England
and Wales) and Data Zone (Scotland):
<!--5.
Description
for LSOA boundaries by the ONS: Census
2021 geographies - Office for National Statistics (ons.gov.uk)
<!--[6.
Postcode to
LSOA lookup tables: Postcode
to 2021 Census Output Area to Lower Layer Super Output Area to Middle Layer
Super Output Area to Local Authority District (August 2023) Lookup in the UK
(statistics.gov.uk)
<!--7.
Legislation history: Legislation -
Drinking Water Inspectorate (dwi.gov.uk)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PUDL v2025.2.0 Data Release
This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.
One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.
Some potentially breaking changes to be aware of:
In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.
We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.
Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.
New Data
EIA 176
Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.
Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.
EIA 860
Added EIA 860 Multifuel table. See #3438 and #3946.
FERC 1
Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:
out_ferc1_yearly_detailed_income_statements
out_ferc1_yearly_detailed_balance_sheet_assets
out_ferc1_yearly_detailed_balance_sheet_liabilities
SEC Form 10-K Parent-Subsidiary Ownership
We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.
See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:
out_sec10k_parents_and_subsidiaries
core_sec10k_quarterly_filings
core_sec10k_quarterly_exhibit_21_company_ownership
core_sec10k_quarterly_company_information
Expanded Data Coverage
EPA CEMS
Added 2024 Q4 of CEMS data. See #4041 and #4052.
EPA CAMD EIA Crosswalk
In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.
The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.
EIA 860M
Added EIA 860m through December 2024. See #4038 and #4047.
EIA 923
Added EIA 923 monthly data through September 2024. See #4038 and #4047.
EIA Bulk Electricity Data
Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.
EIA 930
Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.
Bug Fixes
Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.
Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.
Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.
Quality of Life Improvements
We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.
Other PUDL v2025.2.0 Resources
PUDL v2025.2.0 Data Dictionary
PUDL v2025.2.0 Documentation
PUDL in the AWS Open Data Registry
PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/
PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/
Zenodo archive of the PUDL GitHub repo for this release
PUDL v2025.2.0 release on GitHub
PUDL v2025.2.0 package in the Python Package Index (PyPI)
Contact Us
If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:
Follow us on GitHub
Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter
GitHub Discussions is where we provide user support.
Watch our GitHub Project to see what we're working on.
Email us at hello@catalyst.coop for private communications.
On Mastodon: @CatalystCoop@mastodon.energy
On BlueSky: @catalyst.coop
On Twitter: @CatalystCoop
Connect with us on LinkedIn
Play with our data and notebooks on Kaggle
Combine our data with ML models on HuggingFace
Learn more about us on our website: https://catalyst.coop
Subscribe to our announcements list for email updates.
Facebook
Twitterhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/0HFSQZhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/0HFSQZ
FIRMBACKBONE provides a longitudinal, cross-sectional (panel data) database that contains information about all Dutch organizations with legal status. It consists of the corporate register and provides financial information via annual reports, employment information and information captured from organization websites with support for mode data. Depending on the source, different datasets are provided once or multiple times per year. This is the FIRMBACKBONE Corporate Register database on Dutch organizations and consists of all organizations that are registered at the Dutch Chambers of Commerce. The data was obtained on December 21, 2023 and on February 5, 2024. Organizations founded after this date of organizations for which the entire bankruptcy process has ended and have no other legal liabilities pending (e.g. contractual commitments), are not available in this database. The granularity level of this database is at the establishment level. While available, using employment data from this dataset is discouraged since there is no legal obligation to keep this data up-to-date, especially since alternative databases are available in FIRMBACKBONE with better quality and mode detailed employment information
Facebook
TwitterMealMe provides comprehensive grocery and retail SKU-level product data, including real-time pricing, from the top 100 retailers in the USA and Canada. Our proprietary technology ensures accurate and up-to-date insights, empowering businesses to excel in competitive intelligence, pricing strategies, and market analysis.
Retailers Covered: MealMe’s database includes detailed SKU-level data and pricing from leading grocery and retail chains such as Walmart, Target, Costco, Kroger, Safeway, Publix, Whole Foods, Aldi, ShopRite, BJ’s Wholesale Club, Sprouts Farmers Market, Albertsons, Ralphs, Pavilions, Gelson’s, Vons, Shaw’s, Metro, and many more. Our coverage spans the most influential retailers across North America, ensuring businesses have the insights needed to stay competitive in dynamic markets.
Key Features: SKU-Level Granularity: Access detailed product-level data, including product descriptions, categories, brands, and variations. Real-Time Pricing: Monitor current pricing trends across major retailers for comprehensive market comparisons. Regional Insights: Analyze geographic price variations and inventory availability to identify trends and opportunities. Customizable Solutions: Tailored data delivery options to meet the specific needs of your business or industry. Use Cases: Competitive Intelligence: Gain visibility into pricing, product availability, and assortment strategies of top retailers like Walmart, Costco, and Target. Pricing Optimization: Use real-time data to create dynamic pricing models that respond to market conditions. Market Research: Identify trends, gaps, and consumer preferences by analyzing SKU-level data across leading retailers. Inventory Management: Streamline operations with accurate, real-time inventory availability. Retail Execution: Ensure on-shelf product availability and compliance with merchandising strategies. Industries Benefiting from Our Data CPG (Consumer Packaged Goods): Optimize product positioning, pricing, and distribution strategies. E-commerce Platforms: Enhance online catalogs with precise pricing and inventory information. Market Research Firms: Conduct detailed analyses to uncover industry trends and opportunities. Retailers: Benchmark against competitors like Kroger and Aldi to refine assortments and pricing. AI & Analytics Companies: Fuel predictive models and business intelligence with reliable SKU-level data. Data Delivery and Integration MealMe offers flexible integration options, including APIs and custom data exports, for seamless access to real-time data. Whether you need large-scale analysis or continuous updates, our solutions scale with your business needs.
Why Choose MealMe? Comprehensive Coverage: Data from the top 100 grocery and retail chains in North America, including Walmart, Target, and Costco. Real-Time Accuracy: Up-to-date pricing and product information ensures competitive edge. Customizable Insights: Tailored datasets align with your specific business objectives. Proven Expertise: Trusted by diverse industries for delivering actionable insights. MealMe empowers businesses to unlock their full potential with real-time, high-quality grocery and retail data. For more information or to schedule a demo, contact us today!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of the population according to TAPSE/TRV ratio values.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example of M.tuberculosis H37Rv with some so-called VapB.VapC TA pairs. We propose a more objective nomenclature of the TAS based on the HMM profiles clusters. Note that all VapCs shown here have a PIN Pfam annotation, however their TASMANIA.Tn (Tn) is split into multiple sub-clusters emphasizing the diversity of the PIN domains. In contrast, their associated so-called VapB-like antitoxins have very diverse Pfam annotations, but consistent TASMANIA.An (An) clusters.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Wikipedia Category Granularity (WikiGrain)" data consists of three files that contain information about articles of the English-language version of Wikipedia (https://en.wikipedia.org).
The data has been generated from the database dump dated 20 October 2016 provided by the Wikimedia foundation licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License.
WikiGrain provides information on all 5,006,601 Wikipedia articles (that is, pages in Namespace 0 that are not redirects) that are assigned to at least one category.
The WikiGrain Data is analyzed in the paper
Jürgen Lerner and Alessandro Lomi: Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE, 13(1):e0190674, 2018.
===============================================================
Individual files (tables in comma-separated-values-format):
---------------------------------------------------------------
* article_info.csv contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "granularity"
(decimal) The granularity of an article A is defined to be the average (mean) granularity of the categories of A, where the granularity of a category C is the shortest path distance in the parent-child subcategory network from the root category (Category:Articles) to C. Higher granularity values indicate articles whose topics are less general, narrower, more specific.
- "is.FA"
(boolean) True ('1') if the article is a featured article; false ('0') else.
- "is.FA.or.GA"
(boolean) True ('1') if the article is a featured article or a good article; false ('0') else.
- "is.top.importance"
(boolean) True ('1') if the article is listed as a top importance article by at least one WikiProject; false ('0') else.
- "number.of.revisions"
(integer) Number of times a new version of the article has been uploaded.
---------------------------------------------------------------
* article_to_tlc.csv
is a list of links from articles to the closest top-level categories (TLC) they are contained in. We say that an article A is a member of a TLC C if A is in a category that is a descendant of C and the distance from C to A (measured by the number of parent-child category links) is minimal over all TLC. An article can thus be member of several TLC.
The file contains the following variables:
- "id"
(integer) Unique identifier for articles; identical with the page_id in the Wikipedia database.
- "id.of.tlc"
(integer) Unique identifier for TLC in which the article is contained; identical with the page_id in the Wikipedia database.
- "title.of.tlc"
(string) Title of the TLC in which the article is contained.
---------------------------------------------------------------
* article_info_normalized.csv
contains more variables associated with articles than article_info.csv. All variables, except "id" and "is.FA" are normalized to standard deviation equal to one. Variables whose name has prefix "log1p." have been transformed by the mapping x --> log(1+x) to make distributions that are skewed to the right 'more normal'.
The file contains the following variables:
- "id"
Article id.
- "is.FA"
Boolean indicator for whether the article is featured.
- "log1p.length"
Length measured by the number of bytes.
- "age"
Age measured by the time since the first edit.
- "log1p.number.of.edits"
Number of times a new version of the article has been uploaded.
- "log1p.number.of.reverts"
Number of times a revision has been reverted to a previous one.
- "log1p.number.of.contributors"
Number of unique contributors to the article.
- "number.of.characters.per.word"
Average number of characters per word (one component of 'reading complexity').
- "number.of.words.per.sentence"
Average number of words per sentence (second component of 'reading complexity').
- "number.of.level.1.sections"
Number of first level sections in the article.
- "number.of.level.2.sections"
Number of second level sections in the article.
- "number.of.categories"
Number of categories the article is in.
- "log1p.average.size.of.categories"
Average size of the categories the article is in.
- "log1p.number.of.intra.wiki.links"
Number of links to pages in the English-language version of Wikipedia.
- "log1p.number.of.external.references"
Number of external references given in the article.
- "log1p.number.of.images"
Number of images in the article.
- "log1p.number.of.templates"
Number of templates that the article uses.
- "log1p.number.of.inter.language.links"
Number of links to articles in different language edition of Wikipedia.
- "granularity"
As in article_info.csv (but normalized to standard deviation one).