Real-world assessment of healthcare provided by the National Health Service: The network of regional Beaver® research platforms

Real-world evidence can provide answers on healthcare utilization and appropriateness, post-marketing drugs safety and comparative effectiveness, and cost-effectiveness profiles of healthcare pathways. Healthcare utilization databases, possibly integrated with drug and disease registries, electronic medical records, survey and cohort data (i.e. real-world data), allow to trace healthcare ‘footprints’ left from beneficiaries of National Health Service. Beaver® is a research platform available on demand to Italian regions which we developed for computing indicators of healthcare utilization and clinical outcomes, as well as for generating evidence on effectiveness and costeffectiveness profile. Two distinct solutions may be adopted. One, the so-called Beaver® Light front-end allows to automatically compute health indicators of adherence to official guidelines. Two, the so-called Beaver® Full front-end involves an eight-step procedure entirely driven by the study protocol. In order to fulfil the directives recently issued by the European Parliament and Council and the Italian Authority for the protection of individual data, the platform resides in each region’s infrastructure, so limiting the free movement of electronic health data. Indeed, regional authorities should be responsible for data safety and for allowing data accessibility. The use of standardized and validated algorithms enables to obtain regional estimates that, being obtained by employing regional platforms containing data extracted with standardized procedure, may be compared and possibly summarized by using common meta-analytic techniques. In conclusion, the Beaver® regional platform is a promising tool which may contribute to stimulate healthcare research in Italy.

Beaver ® is at least powered by data accounting for healthcare services delivered by the Italian regions, and the associated expenditures, of which, since the early 2000s, the National Government has made it mandatory for regional governments to collect.The resulting regional databases, that are the so-called healthcare utilization (HCU) or administrative databases, collect a variety of information, including: (i) demographic and administrative data on residents who receive National Health System (NHS) assistance (beneficiaries of NHS practically coinciding with the whole resident population); (ii) hospital discharge records providing information on primary diagnosis, co-existing conditions and procedures coded according to the International Classification of Diseases, 9th Revision Clinical Modification (ICD-CM-9) classification system performed to inpatients admitted in public and private hospitals; (iii) emergency room accesses, providing ICD-CM-9 codes of causes of access to general and specialized emergency/acceptance departments of public and private hospitals; (iv) drugs dispensed by territorial pharmacies and medicaments directly administered in the outpatient setting and day-hospital coded according to the Anatomical Therapeutic Chemical (ATC) classification system; (v) data on outpatient services, including specialist visits, laboratory tests and diagnostic imaging (e.g., X-rays, Computerised Tomography-CT, Magnetic Resonance Imaging-MRI, Positron Emission Tomography-PET) coded according to the National Tariff Nomenclator; (vi) Certificates of Delivery Assistance (i.e., the so called CeDAP) providing information on the mother's socioeconomic traits, as well as medical information on pregnancy, childbirth, and child presentation at delivery; (vii) other data, including those regarding vaccinations, mental health services, and any other form of assistance included in the list of the Essential Assistance Levels (LEA) [14] provided by the NHS.As a unique identification code is used for all databases within each region, their record linkage allows searching out the diagnostic-therapeutic pathway supplied to beneficiaries of NHS.
The above-described data represent the minimum set available in each region and that alone justifies / supports the implementation of a regional platform.However, other data may be integrated within a given Beaver ® platform, depending on the availability of relevant data, and the specific regional concerns.For example, it is possible to feed a Beaver ® regional platform with other institutional sources (e.g., drug and therapeutic plan registries instituted by the Drug Agency -AIFA-, national registry of rare diseases instituted by the Health Institute -ISS-, health surveys and population census instituted by the Central Institute of Statistics -ISTAT-), as well as non-institutional ones (e.g., population-based cancer registries, medical records from primary healthcare and specialist clinics), provided that data is recorded using the same identification code already used for recording NHS administrative data (e.g., the national fiscal code).

PRIVACY AND OTHER ETHICAL ISSUES
In order to ensure a high level of protection of natural persons, rights and rules, ratified by the European Parliament and Council in the recently issued Directive 95/46/EC [15], represented the milestone for designing and managing Beaver ® .In particular, because there "… are circumstances under which it may be reasonable and economical ensuring data protection to be broader than a single project…", a common processing platform may be established by public authority (in our case regional government), so realizing an environment where data may be processed "… in a manner that ensures appropriate security and confidentiality of the personal data, including for preventing unauthorized access ...".
The Beaver ® software platform Some true and false constraints established by EU Directive deserve to be emphasized.In particular, directive faces that "… the processing of special categories of personal data may be necessary for reasons of public interest in the areas of public health without consent of the data subject …", where "… all elements related to … health status ... and the determinants having an effect on that health status, healthcare needs, resources allocated to healthcare, the provision of, and universal access to, healthcare as well as healthcare expenditure and financing, and the causes of mortality…", are here intended in the context of public health.In other terms, to all those elements which have been listed in the Background section of the current paper, and that justify the Beaver ® research platform achievement.On the other hand, the Italian Authority for the protection of individual data established that, although the informed consent to the processing of personal data must always be collected when it is possible to provide adequate information to the subjects included, the impossibility of informing them does not however preclude the processing of the data themselves [16].This occurs when, for example, the large sample size (as in the case of the population residing in a region), as well as the long-time window between data collecting (such as the therapy and/or outcome occurrence) and processing, make it impossible to obtain the consent.On the other hand, obtaining consent only on the survivors of the treatment, and / or the outcome, make it impossible to generate any valid evidence.
High-quality knowledge that can provide the basis to improve the healthcare for a number of people are expected from real-world data.For this reason, we believe that the true ethical challenge is not of protecting institutional HCU data up to the point of making them inaccessible, but rather make them usable to generate solid evidence useful for better caring future patients.This subject raises some questions like "Who can access these data, and under what conditions?".Justified constraints, that we hope will be homogeneously adopted by all the regional administrations, should be the following: • public or no-profit agencies, external to the regional administration (e.g., academy, or other research institutions), able to autonomously process data derived from secondary sources and to generate credible evidence (according to scientific experiences developed in this field), should be included into a list of agencies accredited for submitting research protocols requiring access to the platform; • a detailed protocol including at least the "genuine" research question (addressing an issue of interest for the regional and/or national health policies), the methodology to be adopted (including methods to take into account the sources of systematic vulnerability, so important in the context under consideration) and legal details about the ownership of, and the commitment to publish the results, should be submitted for obtaining the permission to process the data stored in the regional platform; • data processing, once authorized, should be performed without transferring data from the platform that holds them, but with specific queries based on and driven by the protocol.

BEAVER ® ARCHITECTURE AND FUNCTIONING
The above-mentioned general principles drove the design of the Beaver ® platform and led to a two layers architecture: the platform administration layer and the remote user layer (Figure 1).For any technical characteristic, the reader can refer to Table1.
The administrator is the physical person authorized by the Regional Authority for accessing to the regional environment where healthcare data is stored and protected and is in charge of the first layer.Other than for setting and managing Beaver ® into a dedicate fully secured system within the regional environment, the administrator has the tasks of (i) extracting data relating to specific fields of interest, (ii) harmonizing them according to definite protocols, and (iii) allocating each of them in a dedicated database within the Beaver ® environment.Two types of databases may be installed within the platform.
The first type, which we call field-specific database (FSD), concerns a specific diagnostic-therapeutic area and provides only data extraction from each administrative regional database of the information that is common to all regions.For example, for achieving the "diabetes FSD", a two-step procedure of data extraction is carried out.The NHS beneficiaries who leave their 'footprints' suggestive of diabetes through specific services (i.e., at least one antidiabetic drug prescription, one hospital admission with primary or secondary diagnosis of diabetes, and/ or co-payment exemption for diabetes) provided in a definite time-window, are identified in the first step.Their identification code represents the key for grabbing services provided to patients likely affected by diabetes as recorded from administrative archives in the second step.FSD covering areas of oncology, cardiovascular and respiratory diseases, mental disorders, and pregnancychild health, are designed and achieved with analogous procedures.Data extract-transform-load (the so-called ETL procedure) for each FSD is designed uniformly through all the Italian regions participating to the Beaver ® network.Differences regarding the time-window depth covered by administrative recording may depend on data availability.
The second type of database, which we call mixedsource database (MSD), concerns the covering of specific data available in a given region.Experiences of linking administrative data with information from population-based sampling surveys (e.g., the health examination survey managed for Italy by the National Institute of Health) and hospital-based disease registries (e.g., cancer registry from the National Cancer Institute of Milan) are ongoing in Lombardy.Analogously to FSD, the identification code of subjects included into the survey/registry (first step) serves for linking with administrative data (second step).
The second layer of the Beaver ® is devoted to the users who are the physical persons belonging to either the regional administration or an accredited agency who obtained the credentials for accessing a database (FSD or MSD) after the Regional Authority approved the protocol.Two distinct solutions may be adopted.
The first solution (the so-called Beaver ® Light frontend, attainable by regional administration personnel only) allows to automatically compute the set of process and outcome indicators defined by the Health Ministry.Indicators are those reported in the official manual for LEAs monitoring through the assessment of pathways experienced by the NHS beneficiaries suffering from chronic diseases (e.g., diabetes, heart failure, chronic obstructive pulmonary disease, breast, colon or rectum cancer, selected mental disorders, ...), experiencing acute episodes (e.g., myocardial infarction, haemorrhagic stroke, ischaemic stroke, ....), or who are going through a physiologic experience (e.g., pregnancy).By choosing a given FSD (e.g., the diabetes FSD), and a reference year (e.g., 2016), a standard report is generated containing size and rates of the prevalent cohort (e.g., all diabetic patients), incident cohort (e.g., patients newly taken in care for diabetes), process indicators (e.g., prevalent cohort members who adhered to selected recommendations, such as assessments of glycated haemoglobin, lipid profile, urine albumin excretion, serum creatinine and dilated eye exams) and outcome indicators (e.g., incident cohort members who experienced at least one hospital admission for brief-term diabetes complications, uncontrolled diabetes, long-term vascular outcomes, and no traumatic lower limb amputation).Findings stratified for gender, age class, and possibly geographical area of residence (for example local health unit if present), may be obtained using Beaver ® Light.
The second solution (the so-called Beaver ® Full frontend, available for every authorized user), involves an eightstep procedure entirely driven by the protocol approved by the Regional Authority.An easy-to-use interface articulated in a sequence of queries and drop down menus for specifying (i) FSD (or MSD) from which data must to be processed, (ii) inclusion and (iii) exclusion criteria, defining for (iv) exposure(s), (v) covariates, (vi) outcome(s) and (vii) follow-up, allows of choosing time window(s), demographics (age and gender), diagnostics (ICD-9-CM codes), therapeutics (ATC codes), outpatient services (National Tariff Nomenclator), and other data useful for carrying out the study.The output of these first seven steps The Beaver ® software platform is a master table, still not accessible to users, which may be used for data analysis (i.e., the eighth and last step of the sequence).Usually, data analysis is made through another sequence of queries specifying variables of interest (possible transformed with respect to their original form) which must be used for obtaining descriptive statistics, as well as for fitting selected models (e.g., logit, log-binomial, Poisson, or Cox ones).However, in order to make Beaver ® Full as flexible as possible and to apply other and more tailored models and algorithms, data may be directly processed through R or SAS TM (the latter admitted that the licence is available in the regional environment).Finally, a report including results of data processing is obtained in the form of lists, tables and figures are returned to the user, admitted that the regional administrator approves it.

DISCUSSION
In light of the increasing demands for low-cost realworld healthcare data and evidence, the new opportunity arising from the Beaver ® regional research platform, a web-based system for integrating and processing healthcare data, is described in the current paper.
Beaver ® has several strengths.First, because the platform was designed and achieved by means of grants from institutional public agencies (please see Acknowledgments section), it is free of charge to the regions concerned.The contractual form defining intellectual property and the rules for installation, updating and maintenance of Beaver ® is currently being studied, but it is beyond the competence of the authors of this paper.
Second, the rules established by the new European regulation for protection of natural persons in relation to the processing of personal data [15] limiting the free movement of electronic health data, are fully complied.In other terms, because each Italian region must be considered the owner of data on healthcare provided to NHS beneficiaries of that region, data should be stored and processed within a regional secured environment and its movements should be limited to few and exceptional needs (such as those of Health Ministry or for legal questions).Accordingly, Being a web-platform, Beaver ® ships with three modern front-end written in HyperText Markup Language (HTML), JavaScript TM (JS) and Cascading Style Sheets (CSS): Beaver ® Light, Beaver ® Full and an administrative application.Beaver ® Full provides a set of sophisticated input controllers designed to easily follow a protocol and to conduct a study, while Beaver ® Light fully automates those processes.

Security
The three Beaver ® ' front-ends provide a classical authentication system based on a username and a password.Beaver ® never exposes records and/or tables to the final users and is designed to prevent any Standard Query Language (SQL) injection attack.

Engine
The core of Beaver ® is written using "PHP: Hypertext Preprocessor" (PHP) language and is programmed for managing multiple users, receiving and decoding instructions, preparing the transactions for the data manipulation jobs, launching and monitoring the entire jobs life.
Database Regional data and system data is stored in a local PostgreSQL cluster, a well-known robust and high performance open source Relational Database Management System (RDBMS).

ETL (Extract-Transform-Load)
From the administrative front-end it is possible to populate the database using the automated ETL functionalities developed specifically for Beaver ® using the Python TM language.The ETL component uses Apache Spark TM and is designed to take into account each Region data availability, data structures and encodings divergences.It connects to the Regional data source, extracts the data, then cleans, normalizes and loads it into the local database.The loaded data undergoes an optimization process in the RDBMS environment and is made accessible from Beaver ® Light and Beaver ® Full.

Database transactions
When users start an elaboration, the engine produces one or more SQL transactions.These are composed of several highly optimized queries and are executed in a sequential order.Some queries are proved to work on mutually exclusive relations and for this reason can be parallelized.Beaver ® provides an innovative module that easily distributes specific queries even if the adopted RDBMS (PostgreSQL) is supposed to work only with non-parallel queries.Furthermore, some queries, especially the ones involved in time-dependent analysis, are sent to Spark TM in order to maximize the overall performance and for this reason are written in SparkSQL.Beaver ® interacts with Spark TM using a Scala module written specifically for this purpose.

Statistical analysis and reports
In order to run analysis and generate reports, the engine instructs the database to produce meaningful data.This process is very simple for certain models and gets more complex for time-dependent analysis.R gets called and receives instructions about what kind of analysis needs to be run on the data previously produced.The results are stored as JavaScript TM Object Notation (JSON) strings into the database, while the GUI provides a human-readable version of them.Users have the ability to generate a fully printable report on a Adobe ® Portable Document Format (PDF) file that uses a standard A4 paper layout.
External Tools (Apache Zeppelin, R, SAS TM , etc.) Advanced users with knowledge of the underlying Beaver ® ' data structure, can operate using external tools in order to run custom and non-standard analysis.The most versatile and powerful tool is Apache Zeppelin, which provides a frontend for interacting with some interpreters used in data analysis (Python TM , R, generic JDBC SQL drivers, Scala, Apache Spark TM , Apache Hive TM , Apache Pig, etc.).Assuming the license availability and a direct access to Beaver ® ' database, advanced users can adopt any commercial statistical software like SAS TM or Stata ® .

Batch execution and error handling
A section of the Beaver ® platform grants to users the possibility to monitor the jobs execution status in real-time.The engine has the ability to catch errors whenever they occur and logs each job activity.When a job execution ends, the user has full access to its detailed logs, the SQL transactions generated by the engine and some profiling information that can be useful to reveal bottlenecks and system performance degradation.
Beaver ® was designed in order to (i) locally process data, (ii) prevent regional data from leaving the platform that hold it and (iii) allow users to see only the aggregate results of the analysis without any possibility to override the citizens' right to privacy.Third, because evidence appropriately addressing knowledge-based policy improving effectiveness of healthcare and efficiency of health services are expected for processing real-world data, Beaver ® has been designed as a technologic tool (i.e., the Beaver ® platform) attendant the rules for scientific research, at which the Beaver ® network is required to comply.For example, the platform implements algorithms for automatically computing adherence to recommended clinical examinations of NHS beneficiaries suffering from selected disease and conditions.It should be emphasized that the algorithms implemented in the platform were designed according to the Health Ministry official manual for LEA indicators, the latter being validated for their relationship with measurable clinical outcomes.Other than for monitoring healthcare, the Beaver ® platform has been designed for allowing users, external to the regional administration, to generate solid evidence.However, physical persons may be authorized to access platform whether (i) they belong to a public agency (e.g., academy, or other research institutions), (ii) they have documented experience on generating evidence from secondary data, and (iii) a detailed study protocol complying with good practice of epidemiological research using secondary data [17][18][19][20] have been submitted to and approved from the Regional Authority.In other words, the reports which the Beaver ® platform is able to produce, i.e., both the standard report with official data on process and outcome indicators of LEA monitoring, as well as the report expressly built for answering a specific research question, are generated accordingly with a predefined, approved protocol.
Finally, as regions where the platform is up and running are connected to the Beaver ® network, and a common protocol always drives the data process, aggregated data generated within each region may be compared and/or summarized by means meta-analytic techniques [21,22].This is very important because assessments of between-regions healthcare homogeneity (equity), as well as summarized evidence of healthcare implications, may be obtained with comparable data and methods.

CONCLUSIONS AND FURTHER RESEARCH
In conclusion, the Beaver ® regional research platform is a more than promising tool for stimulating healthcare research in Italy.It is currently implemented, or its implementation is ongoing, in some regions (Lombardy, Sicily, Sardinia, Friuli Venice Giulia, Marche and Abruzzo).A cross-validation study involving other regions (Molise and Toscana), other than those above reported, is currently under design.
We propose two priority directions for further research [13].First, there is an urgent need to 'unlock' more detailed data concerning health surveys, drug and disease registries, and large-scale, disease agnostic DNA and biological collections, so expanding and making it sustainable the range of record linkages in order to deliver precision medicine and innovate discovery of new drug targets or repurpose existing drugs.Second, there should be major expansion in the underlying methods and applications of innovative approaches using real-world data to design and carry out real-world studies.This is important for delivering learning health systems and patient benefit.
The recently established Center of Healthcare Research and Pharmacoepidemiology [http://www.chrp.it/], a consortium of sixteen public universities (to which seven other universities are being added), working with the close cooperation of public institutions (e.g., the National Institute of Health), many regions, research and treatment institutes, and scientific societies, has this as its main mission: to promote and spread methods and experiences on the appropriate, correct and safe use of real world data for generating robust evidence useful for addressing health policy.

FIGURE 1 .
FIGURE 1.General architecture for the regional Beaver ® research platform

TABLE 1 .
Architecture component descriptions, implementation and operability purpose