Skip to content

taylorfrancis.json no longer scrapes the PDF: readcube pop-up related? #54

@rossmounce

Description

@rossmounce

I suspect this is down to the recent readcube'ization of T&F content. The PDF download button takes you to a pop-up where you can choose between real PDF or Readcube. I tried to solve this myself but I failed.

$ quickscrape --url http://dx.doi.org/10.1017/s1477201903001093  --scraper journal-scrapers/scrapers/taylorfrancis.json --output tandf -l verbose
info: quickscrape 0.4.7 launched with...
info: - URL: http://dx.doi.org/10.1017/s1477201903001093
info: - Scraper: /home/ross/Downloads/pica/journal-scrapers/scrapers/taylorfrancis.json
info: - Rate limit: 3 per minute
info: - Log level: verbose
info: urls to scrape: 1
info: processing URL: http://dx.doi.org/10.1017/s1477201903001093
debug: info [scraper]. URL rendered. http://www.tandfonline.com/doi/abs/10.1017/S1477201903001093.
debug: data [scraper]. element captured. publisher.  Taylor & Francis Group .
debug: debug [scraper]. element results. publisher.  Taylor & Francis Group .
debug: data [scraper]. element captured. journal_name. Journal of Systematic Palaeontology.
debug: debug [scraper]. element results. journal_name. Journal of Systematic Palaeontology.
debug: data [scraper]. element capture failed. volume.
debug: debug [scraper]. selector had no results. //*[@id='unit2']/div[1]/div/div/table/tbody/tr/td[1]/h3/a[1]. volume.
debug: debug [scraper]. element results. volume. .
debug: data [scraper]. element capture failed. issue.
debug: debug [scraper]. selector had no results. //*[@id='unit2']/div[1]/div/div/table/tbody/tr/td[1]/h3/a[2]. issue.
debug: debug [scraper]. element results. issue. .
debug: data [scraper]. element captured. title. Osteology and systematic position of the eocene primobucconidae (aves, coraciiformes sensu stricto), with first records from Europe.
debug: debug [scraper]. element results. title. Osteology and systematic position of the eocene primobucconidae (aves, coraciiformes sensu stricto), with first records from Europe.
debug: data [scraper]. element captured. keywords.
debug: debug [scraper]. element results. keywords. .
debug: data [scraper]. element captured. author_name.  Gerald   Mayr .
debug: data [scraper]. element captured. author_name.  Cecile   Mourer‐Chauviré .
debug: data [scraper]. element captured. author_name.  Ilka   Weidig .
debug: debug [scraper]. element results. author_name.  Gerald   Mayr , Cecile   Mourer‐Chauviré , Ilka   Weidig .
debug: data [scraper]. element captured. date_published.
debug: debug [scraper]. element results. date_published. .
debug: data [scraper]. element captured. doi. 9512127.
debug: data [scraper]. element captured. doi. 10.1017/S1477201903001093.
debug: data [scraper]. element captured. doi. Journal of Systematic Palaeontology, Vol. 2, No. 1, 2004, pp. 1-12.
debug: debug [scraper]. element results. doi. 9512127,10.1017/S1477201903001093,Journal of Systematic Palaeontology, Vol. 2, No. 1, 2004, pp. 1-12.
debug: data [scraper]. element capture failed. csv1.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][1]. csv1.
debug: debug [scraper]. element results. csv1. .
debug: data [scraper]. element capture failed. csv2.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][2]. csv2.
debug: debug [scraper]. element results. csv2. .
debug: data [scraper]. element capture failed. csv3.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][3]. csv3.
debug: debug [scraper]. element results. csv3. .
debug: data [scraper]. element capture failed. csv4.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][4]. csv4.
debug: debug [scraper]. element results. csv4. .
debug: data [scraper]. element capture failed. csv5.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][5]. csv5.
debug: debug [scraper]. element results. csv5. .
debug: data [scraper]. element capture failed. csv6.
debug: debug [scraper]. selector had no results. //a[@id='CSVdownloadButton'][6]. csv6.
debug: debug [scraper]. element results. csv6. .
debug: data [scraper]. element captured. fulltext_html. http://dx.doi.org/10.1017/S1477201903001093.
debug: debug [scraper]. element results. fulltext_html. http://dx.doi.org/10.1017/S1477201903001093.
debug: data [scraper]. element capture failed. fulltext_pdf.
debug: debug [scraper]. selector had no results. //a[text()='PDF']. fulltext_pdf.
debug: debug [scraper]. element results. fulltext_pdf. .
debug: info [scraper]. download started. fulltext.html.
info: URL processed: captured 8/17 elements (9 captures failed)
debug: writing results to file: results.json
debug: changing back to top-level directory
info: all tasks completed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions