Purpose
This website represents an effort by the Augustus C. Long Health Sciences Library to identify all scholarly journal articles published by researchers affiliated with Columbia University Irving Medical Center (CUIMC) and make that data available to the CUIMC community.
Approach
- Search multiple biomedical databases for articles where at least one author was affiliated with CUIMC at the time of authorship.
- Import and deduplicate the results.
- Assign specific department, division, institute, and/or center affiliations to the article by automated processing of raw author affiliation data.
Our approach is institution-specific and thus avoids the problem of identifying authors with similar names or faculty that are no longer affiliated with CUIMC.
Search queries
PubMed
To request records from PubMed, our system uses the NCBI Entrez Utilties, specifically the Esearch and Efetch APIs. To query PubMed, the following query string is sent to the Esearch endpoint:
(“Columbia University”[Affiliation]) AND ("Columbia University Medical"[Affiliation:~3] OR "Mailman Public Health"[Affiliation:~2] OR "Columbia University Presbyterian"[Affiliation:~5] OR "Columbia University Nursing"[Affiliation:~5] OR "Columbia University Medicine"[Affiliation:~6] OR "Columbia University Surgeons"[Affiliation:~7]) AND ("journal article"[Publication Type]) NOT ("preprint"[Publication Type])
OpenAlex
To request OpenAlex records, our system uses the Works endpoint, filtering works by the institution identification number OpenAlex has assigned for CUIMC (https://openalex.org/I2799503643).
Scopus
Scopus records are exported from the Scopus website in CSV format and then imported to the database. The Scopus Abstract Retrieval API is accessed during the processing of records to add publication date data that is not available from the Scopus export. The following query is used to request records from Scopus:
(AF-ID(60027565) OR AF-ID(60005691) OR AF-ID(60001864) OR AF-ID(60026454) OR AF-ID(60014211) OR AF-ID(60012769) OR AF-ID(60025843) OR AF-ID(60011605)) AND (LIMIT-TO(DOCTYPE,"ar")) AND (LIMIT-TO(SRCTYPE,"j")) AND (LIMIT-TO(PUBSTAGE,"final"))
Web of Science
Records are exported from the Web of Science website as full records in CSV format and then imported into the database. The following query is used to request records from Web of Science:
(AD=(columbia univ SAME med ctr OR columbia univ SAME irving OR columbia univ SAME sch nursing OR columbia univ SAME mailman OR columbia univ SAME coll phys & surg OR columbia univ SAME dent sch OR columbia univ SAME coll dent med OR columbia univ SAME syst biol OR columbia univ SAME dept pathol OR columbia univ SAME dept psychiat OR columbia univ SAME dept neurol OR columbia univ SAME dept physiol OR columbia univ SAME dept surg OR taub inst OR vagelos coll OR columbia univ SAME biostat OR columbia univ SAME environm hlth OR columbia univ SAME naomi berrie OR columbia univ SAME dept anesthesiol OR columbia univ SAME dept med)) AND DT=(Article NOT Early Access NOT Book Chapter NOT Proceedings Paper)
Collected metadata
For each article record, our system collects the following information:
- PubMed ID
- DOI
- Journal of publication
- Raw author affiliation(s)
- Author list
- Subject terms
- Author-supplied keywords
- Publication date
- Data source
- Date the data was recorded
From the author list and raw author affiliations, our system generates:
- Affiliated authors list
- Columbia affiliations
Subjects and author keywords
Included in the article metadata are subjects and author-supplied keywords. Subjects are uniquely identifiable terms that come from a known vocabulary (often called "controlled vocabulary"), such as PubMed's Medical Subject Heading (MeSH) terms. Author keywords are those provided by the article's authors, which sometimes are included as part of a structured abstract. If an article appears in more than one source, subject terms and author keywords from each source are combined in the resulting database record.
- OpenAlex records may contain MeSH headings and OpenAlex-specific Topic terms. Both are saved as subjects.
- PubMed records may contain MeSH headings and keywords. MeSH headings are saved as subjects and keywords as author keywords.
- Scopus records may contain author keywords.
Scopus records can also include index keywords, which may be MeSH headings or Emtree terms. However, exported Scopus data does not include any identifying information to distinguish MeSH headings from Emtree terms, so for the purpose of this project Scopus index keywords are ignored.
Caveats and known issues
- Affiliated Authors: For each record, an attempt has been made to identify the article authors who were affiliated with CUIMC when the article was published. This is done by checking the raw affiliation string associated with the author for certain CUIMC-related keywords. Because of how some article submissions/journals group their affiliations, there may be errors where authors identified as affiliated with CUIMC were actually not.
- Missing Affiliations: A certain number of articles (approximately 5% of all records) do not have a specific affiliation assigned. This occurs when a record did not include any specific affiliation information. In these cases, the raw affiliation was simply "Columbia University Medical Center" or similar.
- Searching for Authors: Authors have not been uniquely identified. While it is possible to search on author names within the database, searches for common names will return all records attributed to multiple authors with that or similar names instead of a single specific author.
- Publication Dates: Because of the way publication dates are given, the database contains articles that are published earlier or later than the current date range of available records.
- Duplicated Subjects: We use the subjects provided directly by OpenAlex and PubMed and do not deduplicate the concatenated list. This may result in subjects that are very similar to each other nevertheless being listed separately (e.g., COVID-19, Coronavirus Disease 2019, Coronavirus Disease 2019 Research).