Which links are archived?
mandy edited this page 4 years ago

Written 02 Jan 2019, information here might be outdated

Partially updated on 19th August 2019. Check source code for verifying.

Edit: The bot has been down from 11 October to 15 December 2019. LMW resumed on 15th, LMA resumed on [TBD].

Three important parts to this question:

  1. What links is this bot able to recognise?
  2. When can this fail?
  3. Ignored links

What links is this bot able to recognise?

At this point in time, the bot scrapes the following activity feeds:

The bot recognises links to the following websites:

  • YouTube
    • youtube.com
    • youtu.be
  • Vimeo
    • vimeo.com
  • Dailymotion
    • dailymotion.com
    • dai.ly
  • Soundcloud
    • soundcloud.com

The regex used to recognise links can be found here.

Well-tested formats: (example ID is italic)

  • ... youtube.com/watch?v=video_ID-01
  • ... youtu.be/video_ID-02
  • ... dailymotion.com/video/x3video
  • ... vimeo.com/313370004
  • ... youtube.com/playlist?list=PLthirtyTwoCharacterPlaylist-ID_05
  • ... PLthirtyTwoCharacterPlaylist-ID_06
  • ... youtube.com/channel/UCaLongChannelID_-aBC007
  • ... youtube.com/user/channelName
  • ... youtube.com/c/channelName
  • ... youtube.com/channelName

Other formats that have appeared successfully include:

  • ... soundcloud.com/artist-name/track-url

When can this fail?

There may theoretically be links present on these feeds which the bot does not find. Known reasons for this are:

  • Missed wikia changes:
    • Excessive wikia activity: the bot currently fetches the last 100 changes from the last 30 days of the Lost Media Archive wikia's activity feed. If more than 100 changes to the wikia occur between two scans (scans are currently a minimum of 15 minutes apart), some of these changes will not be seen by the bot.
    • Excessive load: when a recent activity scan begins, the next scan will not occur until all parsed links in the current scan are successfully downloaded and uploaded to archive.org, or fail. If a large number of long videos are found, it may be hours or even days between scans. If 100 changes to the wikia occur between two scans, some of these changes will not be seen by the bot. A current mitigation technique is to not automatically archive playlists with more than 50 videos. However, automatically calculating the total filesize of a playlist would be a better solution (10 one-hour videos take far longer to upload than 50 one-minute videos)
    • Both of these can be mitigated by decoupling fetching from uploading. This can be implemented as a producer-consumer system. This change is considered a priority.
  • Unrecognised links: if a link is not one of the above sites, it may not be recognised by the bot. Also, if a link is very poorly or unusually formatted or is split over multiple lines, for example, it may not be recognised. The regex used to recognise links on the Lost Media Archive wikia can be found here.
  • External failure: an external issue such as an electrical power failure or an unforeseen fatal issue (such as running out of disk space) may temporarily stop this bot from working. This could result in many wikia changes being missed if the issue cannot be solved quickly. Additionally, when YouTube or Wikia/Fandom changes how their site works, it can cause issues with this bot (usually resolved with a youtube-dl update)
  • Owner Neglect: lol

At this point (May 2019), the most errors are caused by YouTube changing their site or archive.org internal errors.

Apart from the 'unrecognised links' and 'external failure' issues, these failures can be resolved by finding a longer list of changes to the wiki and simply parsing all the links on those pages, so long as none of the links have already been taken down by YouTube/etc.. Implemented as of this commit.

Ignored links

Playlists with more videos than max_playlist (currently 50) are ignored.